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Abstract 

The feedback capacity of additive stationary Gaussian noise channels is characterized as the solution 
to a variational problem. Toward this end, it is proved that the optimal feedback coding scheme 
is stationary. When specialized to the first-order autoregressive moving average noise spectrum, this 
variational characterization yields a closed-form expression for the feedback capacity. In particular, this 
result shows that the celebrated Schalkwijk-Kailath coding scheme achieves the feedback capacity for 
the first-order autoregressive moving average Gaussian channel, positively answering a long-standing 
open problem studied by Butman, Schalkwijk-Tiernan, Wolfowitz, Ozarow, Ordentlich, Yang-Kavcic- 
Tatikonda, and others. More generally, it is shown that a fc-dimensional generalization of the Schalkwijk- 
Kailath coding scheme achieves the feedback capacity for any autoregressive moving average noise 
spectrum of order k. Simply put, the optimal transmitter iteratively refines the receiver's knowledge of 
the intended message. 

I. Introduction 

We consider a communication scenario in which one wishes to communicate a message index W £ 
{1, . . . , 2"^} over the additive Gaussian noise channel Yi = Xi + Zi, i = 1,2, . . . , where the additive 
Gaussian noise process {Zjj^j^ is stationary with = ~ Nn{0, Kz,n) for each n = 

1,2,. .. . For block length n, we specify a (2"^,n) feedback code with codewords X"'(W,Y"'~^) = 
{Xi{W),X2iW, Yi), XniW, y""^)), W = 1,..., 2"-f^, satisfying the average power constraint 

1 " 

-^EXf{W,Y'^^) < P 

i=l 

and decoding function Wn : M" —>{!,..., 2"^}. The probability of error pj"^ is defined as 

w=l 

= FiiWniY'') ^ W} 

where the message W is uniformly distributed over {1,2,..., 2"^} and is independent of Z". We say 
that the rate R is achievable if there exists a sequence of (2"^,n) codes with p]"^ as n — > oo. 
The feedback capacity Cfb is defined as the supremum of all achievable rates. We also consider the 
case in which there is no feedback, corresponding to the codewords = {Xi{W), . . . ,Xn{W)) 

independent of the previous channel outputs. We define the nonfeedback capacity C, or the capacity in 
short, in a manner similar to the feedback case. 
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It is well known that the nonfeedback capacity is characterized by water-filling on the noise spectrum, 
which is arguably one of the most beautiful results in information theory. More specifically, the capacity 
C of the additive Gaussian noise channel Yi = Xi + Zi, i = 1, 2, . . . , under the power constraint P, is 
given by 

1 ^ max{5z(e*^),A} dO 



where Sz{e^^) is the power spectral density of the stationary noise process {Zi}"?!^, i.e., the Radon- 
Nikodym derivative of the spectral distribution of {Zi}°^^ (with respect to Lebesgue measure), and the 
water level A is chosen to satisfy 

/TT JQ 
max{0,A-5z(e*'')}^. (2) 

Although O and ^ give only a parametric characterization of the capacity C(A) under the power 
constraint P{\) for each parameter A > 0, this solution is considered simple and elegant enough to 
be called closed-form. Just like many other fundamental developments in information theory, the idea 
of water-filling comes from Shannon [80], although it is sometimes attributed to Holsinger [31] or 
Ebert [18]. 

For the case of feedback, no such elegant solution exists. Most notably. Cover and Pombra [13] 
characterized the n-block feedback capacity CpB^n for arbitrary time-varying Gaussian channels via the 
asymptotic equipartition property (AEP) for arbitrary nonstationary nonergodic Gaussian processes as 

= kTX 2 det(A-^,„)Vn (3) 

where the maximum is taken over all positive semidefinite matrices Ky n and all strictly lower triangular 
Bn of sizes n X n satisfying tr {Kv,n + BnKz,n{Bn)') < nP. When specialized to a stationary noise 
process, the Cover-Pombra characterization gives the feedback capacity as a limiting expression 

Cfb = lini CpB.n 

n— >oo 

1 det{Kv,n + {Bn+In)Kz,n{Bn + In)'Y''' 

= lim max - log ■ . (4) 

Despite its generality, the Cover-Pombra formulation of the feedback capacity falls short of what we 
can call a closed-form solution. It is very difficult, if not impossible, to obtain an analytic expression 
for the optimal {Ky^,B*) in ^ for each n. Furthermore, the sequence of optimal {Ky^,B*}'^^^ is 
not necessarily consistent, that is, {Ky^,B*) is not necessarily a subblock of {Ky^_^_^,B*_^_-^). Hence 
the characterization ^ in itself does not give much hint on the structure of optimal {Ky^,B^}^^i 
achieving CFB,n, or more importantly, its limiting behavior. 

In this paper, we make one step forward by first characterizing the Gaussian feedback capacity Cfb 
in Theorem IIV.II as 

^ r 1 . Sv{e'') + \l + B{e'')\^Sz{e'') dO 
Cfb= sup / -log — — (5) 

where Sz{e^^) is the power spectral density of the noise process {Zi}°^^ and the supremum is taken 
over all power spectral densities Sv{e^^) > and all strictly causal finite impulse response filters 
B{e'^) = J2T=i he'''^ satisfying the power constraint jZjSv{e'^) + \B{e'^TSz{e'^)) ^ < P- Roughly 
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speaking, this characterization shows the asymptotic optimality of the stationary solution {Ky^,B*) in 
^ and hence it can be viewed as the justification for interchange of the order of limit and maximum 
in (EJl. 

Since our characterization is in a variational form, we will subsequently find in Propositions IV. II 
and IV. 31 necessarv and sufficient conditions for the optimal (iSy (e*^), i?*(e*^)) from Lagrange duality 
theory and additional information theoretic arguments. This result, when specialized to the first-order 
autoregressive (AR) noise spectrum Sz{e^^) = |l+/3e*^|~^, — 1 < /? < 1, yields a closed-form expression 
for feedback capacity as 

Cfb = - log xq 



where xq is the unique positive root of the fourth-order polynomial 

Px 



2 (1 - 



(1 + |/?|X)2' 

establishing the long-standing conjecture by Butman [7], [8], Tieman-Schalkwijk [91], [90], and Wol- 
fowitz [97]. In fact, we will obtain an explicit feedback capacity formula for the first-order autoregressive 
moving average (ARMA) noise spectrum in Theorem IVI.ll which generalizes the result in [45] and 
confirms a recent conjecture by Yang, Kavcic, and Tatikonda [105]. As we will see later, our result 
shows that the celebrated Schalkwijk-Kailath coding scheme [76], [77] achieves the feedback capacity. 

More generally, we will show in Theorem IVII. 1 1 that a fc-dimensional generalization of the Schalkwijk 
Kailath coding scheme achieves the feedback capacity for any autoregressive moving average noise 
spectrum of order k. 

The literature on Gaussian feedback capacity is vast. Instead of trying to be complete, we sample the 
results that are closely related to our discussion. A more complete survey can be found in [45]. The 
standard literature on the Gaussian feedback channel and associated simple feedback coding schemes 
traces back to Elias's 1956 paper [21] and its sequels [28], [22]. Schalkwijk and Kailath [76], [77] made 
a major breakthrough by showing that a simple linear feedback coding scheme achieves the feedback 
capacity of the additive white Gaussian noise channel with doubly exponentially decreasing probability 
of decoding error. More specifically, the transmitter sends a real-valued information bearing signal at the 
beginning of communication and subsequently refines the receiver's knowledge by sending the error of 
the receiver's estimate of the message. This simple coding scheme, or no coding in a sense, achieves the 
capacity of the Gaussian channel and the resulting error probability of the maximum likelihood decoding 
decays doubly-exponentially in the duration of the communication. This fascinating result has been 
extended in many directions, for example, by Pinsker [70], Omura [61], Wyner [99], Schalkwijk [78], 
Kramer [48], Zigangirov [109], Schalkwijk and Barron [79], and Ozarow and Leung- Yan-Cheong [67], 
[64]. 

Following these results on the white Gaussian noise channel, the focus naturally shifted to the 
feedback capacity of the nonwhite Gaussian noise channel. Butman [7], [8] extended the Schalkwijk- 
Kailath coding scheme to autoregressive noise channels. Subsequently, Tiernan and Schalkwijk [91], 
[90], Wolfowitz [97], and Ozarow [65], [66] studied the feedback capacity of finite-order autoregressive 
moving average additive Gaussian noise channels and obtained many interesting upper and lower bounds. 
Recently, Yang, Kavcic, and Tatikonda [105] (see also Yang's thesis [104]) revived the control-theoretic 
approach (cf. Omura [61], Tiernan and Schalkwijk [91]) to the finite-order autoregressive moving average 
Gaussian feedback capacity problem. After reformulating the feedback capacity problem as a stochastic 
control problem, Yang et al. used dynamic programming for the numerical computation of Cfg „ and 
offered a conjecture that Cfb can be characterized as a solution of another maximization problem, the 
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size of which depends only on the order of the noise process. 

With a more general line of attack, Cover and Pombra [13] obtained the n-block capacity for the 
arbitrary nonwhite Gaussian channel with or without feedback, using an AEP theorem for nonstationary 
nonergodic Gaussian processes. (Recall ^ for the feedback case; the nonfeedback case corresponds to 
taking B = 0.) They also showed that feedback does not increase the capacity much; namely, feedback 
at most doubles the capacity (a result obtained by Pinsker [71] and Ebert [19]), and feedback increases 
the capacity at most by half a bit. The extensions and refinements of the Cover-Pombra result abound. 
Ihara obtained a coding theorem for continuous-time Gaussian channels with feedback [35], [37] and 
showed that the factor-of-two bound on the feedback capacity is tight by considering cleverly constructed 
nonstationary channels for both discrete [36] and continuous [33] cases. Dembo [15] studied the upper 
bounds on C^-g ,„ and showed that feedback does not increase the capacity at very low signal-to-noise 
ratio or very high signal-to-noise ratio. (See Ozarow [65] for a minor technical condition on the result 
for very low signal-to-noise ratio.) Ordentlich [62] examined the properties of the optimal solution 
{Kv,n,Bn) for CfB,n in ^ for a fixed n and showed that the optimal Kv,n water-fills the new noise 
spectrum (/„ + Bn)Kz,n{In + Bn)' and that the optimal filter Bn makes the input signal orthogonal 
to the past output. Based on these two crucial observations, he also found that the optimal AV,n has 
rank k for moving average noise processes of order k. Yanagi and Chen [102], [10], [11] studied 
Cover's conjecture [12] that the feedback capacity is at most as large as the non-feedback capacity 
under twice the power, and also made several refinements on the upper bounds by Cover and Pombra. 
Recently a counterexample to Cover's conjecture was found by the author [44]. Thomas [89], Pombra 
and Cover [73], and Ordentlich [63] extended the factor-of-two bound result to the colored Gaussian 
multiple access channels with feedback. 

Despite many developments on the nonwhite Gaussian channels, the exact characterization of the 
feedback capacity has been open, even for simple special cases. In [45], the author obtained the closed- 
form capacity formula for the special case in which the noise process has the first-order moving average 
spectrum, establishing the feedback capacity for the first time. Thanks to the special structure of the 
noise spectrum, the maximization problem in ^ can be solved analytically under the modified power 
constraint on each input signal Xj, i = 1, 2, . . . . Then, a fixed-point theorem exploiting the convexity of 
the problem is deployed to show the asymptotic optimality of the uniform power allocation over time. 
This result confirms the common belief that the stationary Schalkwijk-Kailath linear coding scheme 
achieves the feedback capacity. A similar argument also shows that the uniform power allocation is 
asymptotically optimal for the Schalkwijk-Kailath coding scheme if the noise process has the first-order 
autoregressive spectrum. 

Our approach in this paper is different from the one taken in [45] and is geared towards the general 
case. As is hinted in the similarity between the Cover-Pombra characterization of the Gaussian feedback 
capacity in Q and the variational characterization our development starts from the n-block capacity 
formula The variational formula however, certainly has the flavor of spectral analysis, in the 
context of which we will derive properties of the optimal solution {Sp{e^^), B*{e^^)). This optimal 
solution will be then linked to the asymptotic behavior of the linear coding scheme by Schalkwijk and 
Kailath, and its generalization by Butman. Thus in a sense our development goes in a full circle through 
the literature cited above. 

We will make parallel developments of both nonfeedback and feedback cases, especially because the 
well-trodden nonfeedback capacity problem provides a test bed for new techniques. Hence, we revisit the 
Gaussian nonfeedback capacity problem in Section |ffl] and derive the water-filling capacity formula Q 
in a rather nontraditional manner. In Section |lVl we go through similar steps for the feedback case to 
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establish Naturally, we will encounter a few technical difficulties that do not arise in the nonfeedback 
case. Section |V] deals with sufficient and necessary conditions on the optimal solution {Sy,B*) to the 
variational problem As a corollary of this result, we obtain the closed-form feedback capacity 
formula for the first-order ARMA Gaussian channel. We will then interpret this result in the context of 
the Schalkwijk-Kailath coding scheme. We will also discuss the general finite-order ARMA channels 
in Section IVIII The next section recalls necessary results from various branches of mathematics. 



II. Mathematical Preliminaries 
A. Toeplitz Matrices, Szegd's Limit Theorem, and Entropy Rate 

We first review a few important results on spectral properties of stationary Gaussian processes, which 
we will use heavily for the variational characterization of feedback capacity. 

Let R{k) = R{—k) = EZiZ^j^i, k = 0,1,2, ... ,he the covariance sequence of a stationary Gaussian 
process {Zi}'^^. Then, as the elegant answer to the classical trigonometric moment problem shows (see, 
for example, Akhiezer [1] and Landau [52]), there exists a positive measure ji on [— 7r,7r), sometimes 
called the power spectral distribution of the process {Zi}'?^^, such that 



for all k. From the Lebesgue decomposition theorem, we can write as a sum = fiac + fJ-s, where 
fiac is absolutely continuous with respect to Lebesgue measure and fig is singular. The Radon-Nikodym 
derivative of fiac (with respect to Lebesgue measure), called the power spectral density of {Zi}°^-^^, 
exists almost everywhere and can be written as a function of e*^, or more specifically, we have dfiac = 
S{e^^)d9 = Re F{e^^)d6 for some function F{z) analytic on the unit disc D = G C : |z| < 1} with 
F(0) > and ReF(z) > on D. 

Conversely, given a nontrivial (i.e., supported by infinitely many points) positive measure dp = 
S{e^^)d9 + dfis, the Toeplitz matrix Kn of size n x n given by 

Knii, ^) = ^ e-'^'~'^'d^ie), l<j,k<n 

is positive definite Hermitian. Hence, has n positive eigenvalues Ai(i^„), . . . , A„(J^„), counting 
multiplicity. In his famous limit theorem [86], [87], Szego proved an elegant relationship between the 
asymptotic behavior of the eigenvalues of Kn and the associated spectral distribution p. This result lies 
at the heart of many different fields, including operator theory, time-series analysis, quantum mechanics, 
approximation theory, and, of course, information theory. Here we recall a fairly general version of 
Szego's limit theorem, which can be found in Simon [82, Theorem 2.7.13]. 

Lemma II.l (Szego's Limit Theorem). Let f be a continuous function on [0, oo) such that 

lim = c < oo. 

X— >oo X 



Then, 



1 " df) r 
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The above limit theorem is sometimes called the first Szego theorem, in order to be distinguished 
from the second-order asymptotics often called the strong Szego theorem and obtained by Szego himself 
after a 38-year gap [88]. Refer to Grenander and Szego [29, Chapter 5], Bottcher and Silbermann [5, 
Chapter 5], Gray [27], and Barry Simon's recent two-part tome on orthogonal polynomials on the unit 
circle [82] for different flavors of Szego's theorem under different levels of generality. 

As a canonical application of Szego's limit theorem, the following variational statement, attributed to 
Szego, Kolmogorov [46], and Krein [49], [50], connects the entropy rate, the spectral distribution, and 
the minimum mean-square prediction error of a stationary Gaussian process. 

Lemma II.2 (Szego-Kolmogorov-Krein Theorem). Let {Zi}'?^_^ be a stationary Gaussian process 
with a nontrivial spectral distribution dfi = S{e^^)d9 + dfis- Then the minimum mean-squared prediction 
error E^o = E{Zq — E{Zq\Zz}x}))'^ of Zq from the entire past Z^, k < 0, is given by 

E^ = inf ^ r \1 -f^ake'^'l'dfiie) 

{afc} 2tT ^ 

.e.p(/j„,5(e.«,|) 

lire 

where h{Z) = lim„__>oo n~^h{Zi, . . . , Zn) denotes the differential entropy rate of the process {Zi}. 

The proof of this result follows almost immediately from Szego's limit theorem with f{x) = logx. 
Note that the prediction error depends only on the absolutely continuous part of the spectral measure; 
this is no surprise for us, since \mix-^aoi}ogx)/x = 0. (The fact that the prediction error is independent 
of the singular part of the spectral distribution can be also proved from somewhat deeper results on 
shift operators and Wold-Kolmogorov decomposition. See, for example, Nikolski [60] and references 
therein.) We stress the relationship between the entropy rate of a stationary Gaussian process {Zi} and 
its spectral density S{e'^^) in the following familiar expression: 

-log(27re5(e^^))-. (6) 

Throughout this paper, in order to exclude the trivial case of unbounded capacity, we will assume 
that the power spectral distribution /u of the additive Gaussian noise process {Zi}'^^ is nontrivial 
(equivalently, Kn is positive definite for all n), and that the power spectral density Sz{e^^) satisfies 
the so-called Paley-Wiener condition: 

/IT JQ 
|logSz(e*')| — <oo, (7) 

which is equivalent to having prediction error Eoo > 0. Unless noted otherwise, we will also assume 
that the power spectral distribution fi of the noise process has an absolutely continuous part only, i.e., 
/is = 0, which is justified in part by Szego-Kolmogorov-Krein theorem (i.e., we can filter out the 
deterministic part of the noise to arbitrary accuracy by sending a pilot sequence) and in part by physical 
reality (i.e., the mathematical model of the singular noise spectrum may have no counterpart in physical 
communication systems; see, for example, Slepian's Shannon Lecture [85]). 
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B. Hardy Spaces, Causality, and Spectral Factorization 

We review some elementary results on Hardy spaces (see, for example, Duren [17], Koosis [47], 
Rudin [75, Chapter 17]) that are needed for analysis of optimal feedback filters. Our exposition loosely 
follows two monographs by Partington [68], [69]. 

Let f{z) = X^^o '^n^" t)e an analytic function on 3 = {z £ C : \z\ < 1}. We say that f{z) belongs 
to the class Hp, 1 < p < cxd, if 

/ /-TT JO \ 1/P 

is bounded for all r < 1. Similarly we say that f{z) belongs to the class ifoo if 

\\f\\H^ = sup \f{z)\ 

\z\<l 

is bounded. We can easily check that Hp is a Banach space for 1 < p < cx). 

It is well-known that / G Hp can be extended to T = 83 = {z G C : |z| = 1} by taking the pointwise 
radial limit 

/V)=lim/K') 

rjl 

which exists for almost all 9. The extended function / belongs to the standard Lebesgue space Lp on 
M/[— 7r,7r) ~ T with the same norm = ||/||_f/p> so that we can consider Hp as a closed (and thus 
complete) subspace of Lp. Therefore, we will identify / G Hp with its radial extension f £ Lp and 
use the same symbol / for both / and / throughout. More specifically, when we say that a function 
/(e*^) for 9 G [— 7r,7r) belongs to Hp, we implicitly mean that f{z) is also well-defined and analytic 
on B. Also we will use f{z) and /(e*^) interchangeably if the context is clear. Recall the following set 
inclusion relationship between important classes of functions on T: 

Hp C Lp, 1 < p < oo, 
Hoo C H2 C Hi, 

and 

Loo C L2 C Li. 

Let /(e*^) G Lp, 1 < p < 00. We say that / is causal if its Fourier coefficients 

cn = f f{e'')e-'^'^, n = 0,±l,±2,..., 

satisfy c„ = for n < 0. We also say that / is strictly causal if = for n < 0, or equivalently, 
f{z) = zg{z) for some causal g G Lp. By reversing the direction of the time index, we also define 
anticausality and strict anticausality in a similar way. 

If / G Hp, then / can be easily shown to be causal. (See Lemma HO) below.) Conversely, if f £ Lp 
is causal, then sup„ |c„| < 00 so that / is analytic on ID with 

00 

f{z)=Y,CnZ'', (8) 
n=0 

where the series on the right-hand side converges pointwise on D). Therefore, we can identify the class 
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Hp with the class of causal Lp functions, which gives an alternative definition of the Hp space. 

When / G H^o, we have the pointwise convergence of the infinite series in (|8ll on T = {e*^ : 6* £ 
[— 7r,7r)} for almost all 9. Hence, / G H^o preserves the causality when acting on Li by multiplication. 
For later use, we stress this simple fact in the following statement, the proof of which easily follows 
from the dominated convergence theorem. 

Lemma II.3. Let f € H^o and let g & Li be causal. Then, fg G Li is causal. If, in addition, f is 
strictly causal, then fg £ Li is strictly causal and 



We recall a few important factorization theorems. The first set of results deals with the factorization 
of Hp functions. Suppose /(e*^) G Hp^ 1 < p < oo, is not identically zero. Then, / has a factorization 
f{z) = g{z)u{z) that is unique up to a constant of modulus 1, where g{z) is an inner function (i.e., 
g{z) is an Hoc function with g{e^^) = 1 almost everywhere) and u{z) is an Hp outer function given by 

„(.).exp(£j±ll„,|/,.«)|^). 

Consequently, the zeros of / (inside the unit circle) coincide with the zeros of g, and 
We define the (infinite) Blaschke product b{z) formed with the zeros of f{z) as 

k TT Zn Zn Z 



h{z) = z' \[ 



where {z„} are the zeros of /, listed according to their multiplicity, k of them being at 0. It is easy to 
check that b{z) is well-defined in the sense that b{z) converges uniformly on compact sets to an H^o 
function. Also, b{z) < 1 and |fo(e'^)| = 1 almost everywhere. As a refinement of the above inner-outer 
factorization theorem, F. Riesz showed that / has a factorization f{z) = b{z)s{z)u{z) that is unique 
up to a constant of modulus 1, where b is the Blaschke product of the zeros of /, s is a singular inner 
function (without zeros), and u is an outer function. Again ||/||p = \\u\\p. 

For our purposes, it is more convenient to introduce a normalized variant of the Blaschke product as 



Then, |?^(e*^)| = Y[\z \jto(^/\^ri\) almost everywhere. This normalized Blaschke product is often called 
an all-pass filter in the signal processing literature if {zn} is finite and k = 0. 

If / G -^2 and /(O) = 1, then / has the unique factorization f{z) = b{z)u{z) where b{z) is the 
normalized Blaschke product formed with zeros {zn} of / and u{z) does not have any zero inside the 
unit circle. In particular, 6(0) = n(0) = 1. Now Jensen's formula states that, if g{z) G H2 with g{0) = 1, 
then 

riog|5(re^^)|f = logn^ 

^^-^ I fcl 

where ai, . . . , ap denote the zeros of g{z) within the circle of radius r. Therefore, 

£log|/(e^^)| I = I + £log|^(e^^)| I = logll^- (9) 
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As a trivial corollary, if / is rational of the form 

f(,) = £i^= l + EtlPn^" ^ n(l-/3nM 

with all zeros {7n} of Q{z) strictly outside the unit circle, then 

where . . . , /3p denote the zeros of P{z) inside the unit circle. 

Our last factorization theorem is concerned with the factorization of positive Li functions and is 
usually called the canonical factorization theorem. Suppose /(e*^) G Li. Then, /(e*^) = |5(e*^)p for 
some g{e^^) G H2 if and only if /(e*^) > almost everywhere and the Paley-Wiener condition © is 
satisfied. In the light of the aforementioned factorization theorem due to F. Riesz, we can always take 
the canonical factor g with no zeros inside the unit circle and ^'(0) > 0. 



C. Discrete Algebraic Riccati Equations 

Discrete algebraic Riccati equations (DAREs) often play a crucial role in many estimation and control 
problems. Our problem is no exception, especially the characterization of ARMA(A;) feedback capacity 
in Section rvni 

Here we focus on a very special class of Riccati equations and review a few properties of them. 
Since the necessary results are somewhat scattered in the literature, we also provide short proofs along 
with probabilistic interpretations; some of these might be new. Whenever possible, however, we will 
refer to standard references. For a more general treatment, refer to Kailath, Sayed, and Hassibi [42] and 
Lancaster and Rodman [51]. 

Given matrices F G M'^^^' and H G M^^^', we study the following discrete algebraic Riccati equation: 

E = FEF'-<££^ffl^5in:. (10) 

For each A; x /c Hermitian matrix S, define 

r = r(E) ™' 



1 + 

We are concerned with solutions of (fTot . especially the ones with stable F — TH. 

Lemma II.4 (DARE). Suppose F has no unit-circle eigenvalue and {F, H} is detectable, that is, there 
exists G G M^^'^ such that F — GH is stable (i.e., every eigenvalue of F — GH lies inside the unit 
circle). Then, the following statements hold. 

(i) S = is a solution to dlOt . 

(ii) There is a unique solution S = T,^ to (llOt such that F — TH is stable. Furthermore, S+ ^ S/or 
any other S satisfying dlOt . In particular, S+ is positive semidefinite. 

( Hi) If F is invertible, then F — TH is invertible for each solution S and 

^ + - det(F - TH) ■ 
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\Xj\ > 1 > |A 



> ...|Afc|, 



(iv) Let r_|_ = r(S_|.). If F has eigenvalues Ai, . . . , Afc with |Ai| > 
then F — F^H has eigenvalues 1/Ai , . . . , 1/Aj , Aj+i , . . . , A^. 

(v) If every eigenvalue of F lies inside the unit circle, then the stabilizing solution is identically 
zero. Thus, = is the unique positive semidefinite solution to dlOt . 

(vi) If every eigenvalue of F lies outside the unit circle, then S+ >- 0. 

(vii) More generally, suppose F has j eigenvalues outside the unit circle and k — j eigenvalues inside 
the unit circle. Then, rank(S+) = j. 

Proof, (i) Trivial. 

(ii) Refer to [42, Theorem E.5.1]. 

(iii) Note that det(l + ifSif') = det(/ + T,H'H). Now simple algebra reveals that {F - TH){I + 
EH'H) = F. 

(iv) For simplicity, we assume that F is invertible. We can easily check that 

(F-r(s)if)-i ""'"^ 

-H'HF-^ {F - T{T,)Hy 

for any solution S, which implies that the eigenvalues of {{F — TH)' , {F — TH)"^} coincides with 
those of {F',F^^}. Now the desired result follows from the fact that F — r(5]+) is stable. 

(v) Refer to [42, Theorem E.6.1]. 

(vi) Refer to [42, Theorem E.6.2]. 

(vii) For simplicity, suppose F can be diagonalized; the general case can be proved by using the 
generalized eigenvectors associated with the Jordan canonical form of F. Take each eigenvalue- 
eigenvector pair (A,x) of F with |A| > 1. Suppose = 0. Then, we can easily check that 
x{F — T^H) = xF = Xx, which violates the stability of F — T^H. Thus, xE+ ^ 0, which implies 
rank(E+) > j. 

On the other hand, take each eigenvalue-eigenvector pair (A, x) of F with |A| < 1. From (fTOb . we 

have ^ ^ 

12 „ / FYj^H HY^j^F 





0' 




1 






H'HF~^ 


F' 
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s" 










or equivalently. 



xE+x' = |A| xS_|_x' 
(1- |A|2)xE+x' + 



1 + HH+H' 
FE.H'HE+F' 



1 + HT.+H' 



0. 



Since both terms of the above sum are nonnegative, we must have a;S_| 
rank(S_|_) < j. 



0, which implies 
□ 



Algebraic Riccati equations naturally arise from asymptotic behaviors of recursive filters (e.g., Kalman 
filters). In the following lemma, we collect a few results on the convergence of the Riccati recursion. 

Lemma II.5 (Discrete Riccati recursion). Under the same assumption on {F, H} as in Lemma \II.4\ 
suppose is defined as 



^, {FYnH'){FYnH'y 



(11) 



for some T,q. Then, the following statements hold: 

( i) Tjf Eq = 0, then S„ = for all n. 

( ii) Tjf So ^ 0, then S„ ^ for all n. 
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(Hi) -Z/' So ^ So ^ 0, then S„ ^ S„ ^ for all n. 

(iv) If T,o y 0, then E„ — > where E_|_ ^ w f/ie unique stabilizing solution to the DARE (fTOb . 

Proof, (i) Trivial. 

(ii) Write (Ell as 

= {F- r(s„))s„(F - r(s„))' + r(s„)r(s„)'. 

(iii) Refer to Caines [9, Theorem 3.5.1]. 

(iv) Let n ^ be the unique solution of the Lyapunov equation 

n = (F - T+H)'li{F - T+H) + (12) 

(Lemma lll.4l guarantees the stability of F — T^H and hence there exists a unique positive semidef- 
inite 11 satisfying il2l .) Take any e > such that Sq >z el and / + (el — S+)n is nonsingular. 
Now from Lemma 14.5.7 in [42], we have 

/ + (ni/2)'(5]o - s+)ni/2 yi+ (nV2)'(e/ _ s+)nV2 ^ o, 

which implies the exponential convergence of S„ to S+ by Theorem 14.5.2 in [42]. □ 

Although our approach so far has been mostly algebraic, we can give probabilistic interpretations to 
the above results in the context of linear stochastic systems. Since {F, H} is detectable, we will take 
some G such that F — GH is stable. Consider the following state-space representation (see, for example, 
Kailath [41]) of a stationary Gaussian process {Yn}^ 



\ oo 

— oo " 



S-n+l — {F — GH)Sn — GUn 
Yn = HSn + Un 

where {Un}'^=-oo are independent and identically distributed zero-mean unit-variance Gaussian random 
variables, and the state Sn is independent of Un for each n. It is easy to see that {Yn}'^=_oo corresponds 
to the filter output of the input process {?7„}J^_q^ through a linear-time invariant filter with transfer 
function 

> det{I - z{F - GH))- ^'^^ 

Consider the state-space representation for the innovations Yn = Yn — E{Yn\Y^^). Write Sn = 
Sn - E{Sn\Y!^~^) and S+ = cov{Sn\Y!^~^) = cov(S„). Define r+ = r(S+) as before. Then, we can 
check through a little algebra that 

Sn+l = {F — ^+H)Sn — r+Un 
Yn = HSn + Un 

which implies that 

s+ = (F - T+H)^+{F - T+H)' + r+r; 

= F'- (F^+H')iF^+H'y 
+ 1 + HT.+H' 

Clearly, there must be a unique solution S+ to the above equation that makes the above state-space 
representation well-defined; this implies Lemma Hl.^rii)! 
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Note that {Yn}?^=^oo is the output of {Un}'^=^oD via the filter 

det(/ - zF) 



9iz) 



det{I - z{F -r+H))' 



On the other hand, the innovations process {Yn}^=-oo i^ white. Therefore, g{z) should be a normalized 
Blaschke product (all-pass filter), which implies Lemma III.^riv)| Furthermore, since var(y„) = 1 + 
HTij^H' , applying Jensen's formula, we have a stronger version of Lemma lll.^riii)| The rank condition 
on S+ (Lemma ln.^rvii)| i can be viewed as how many "modes" of the state can be causally determined by 
observing the output. Our development also gives a special case of Szego-Kolmogorov-Krein theorem. 
For example, if F is invertible, 

h{y) = \ \og{2TTe{l + H^+H')) 

1, /n N li ( det(F) 
= -log(2vre) + -log^-j-^^^^ 



1 I r 

-log(27re) + - j log 



det(/ - e'^F) 



d9 
2^' 



det(/-e»^(F-GF)) 

where the last inequality can be justified by the inner-outer factorization theorem and Jensen's formula. 

Now we consider a slightly nonstationary Gaussian process {y^}^^^, recursively defined with the 
same state-space equation (I13t . but under the initial condition 5o = and Uq = 0. Let Tn denote the 
linear transformation from {Ui, . . . , Un) to (y/, . . . , Y^) that corresponds to our state-space model. It is 
easy to see that T„ is Toeplitz (with respect to the natural basis on {Ui, . . . , Un)) and, in fact, 

Tn{j,k) = |'^/(e^^)e-^(^-^-)^^ (15) 

where f{z) is the very transfer function in (fT4b . Since T„ is lower triangular with diagonal entries equal 
to 1 and thus det(T„) = 1 for all n, the entropy rate of {Y^} is given as 

n— >oo n n— ►oo n 2 

which is strictly less than the entropy rate h{y) = ilog(27re(l + HT^j^H')) of the stationary process 
{y„} under the same state -representation ( fT^ . provided that F has an eigenvalue outside the unit circle. 

The nonzero gap between the entropy rate h{y) of the stationary process {y„}J^^ and the entropy 
rate h{y') of its nonstationary version {y^}^]^ can be understood from a beautiful result on Toeplitz 
operators by Widom; see Bottcher and Silbermann [5, Proposition 1.12, Proposition 2.12, and Example 
5.1]. We use the notation T{f) to denote the Toeplitz operator associated with symbol f as in il5l and 
Tn{f) £ M"^" to denote the finite truncation of T{f). Since the power spectral density of the stationary 
process {Yn} is |/(e'^)p, our previous discussion on Toeplitz matrices and the trigonometric moment 
problem shows that the covariance matrix of (Yi, . . . ,y„) is simply T„(|/p). On the other hand, from 
our construction of the nonstationary process {Y^}, the covariance matrix of {Y(, . . . , Y^) is given as 
Tn{f){Tn{f))'. Now Widom's Theorem shows 

T{\f\')=T{f){Tif))' + {H{f))' 
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where H = H{f) is the Hankel operator associated with symbol / and is given by 

(This result should not be confused with the Wiener-Hopf factorization T(|/p) = {T{f))'T{f); see [5, 
Section 1.5].) Thus, the Hankel adjustment term H^{f) contributes to the strict gap between the entropy 
rates. We can represent Yn = + Vn for some nonstationary process {Vn} with infinite covariance 
matrix H^{f) such that J2n-^^n < Roughly speaking, the perturbation process {Vn} with bounded 
total power causes a strict boost in entropy rate. (Although our / is rational, this phenomenon generalizes 
to any / in Krein algebra, in which case H'^{f) is a trace class operator [5, Section 5.1].) 

Finally we remark that our previous discussion on the Riccati recursion implies a much stronger result 
on the boost of entropy rate due to small perturbation. Consider Y^ = Y^ + Vn where . . . , VD has a 
positive definite covariance matrix and = for all n > k. Lemma lll.5|riv)| shows that the entropy rate 
of {y^'} is ^ log(27re(l + i7S+//')), and hence any tiny perturbation to the nonstationary process results 
in the entropy rate of the stationary version. Later, this phonomenon gives an alternative interpretation 
of the role of message -bearing signals in feedback communication. 

The following example illustrates our point. Define {y^}^^ as 

Y{ = Ui 

Y.;^ = Un + aUn~i n = 2,3,..., 

where a is a constant with |a| > 1. Then, the entropy rate of the process {y/}^]^ is ^ log(27re), although 
{y2, yj, . . .} is stationary with entropy rate ^ log(27rea^). Now define {Yn}'^=i as 

y/' = Ui + eV 

Y'; = Un + aUn-i n = 2, 3, . . . , 

where e > is an arbitrary constant and V ~ N{0, 1) is independent of {Un}'^=i- Then, the entropy 
rate of the perturbed process is ^ log(27reQ;^). Evidently, the entropy rate is discontinuous at e = and 
any tiny perturbation results in the same boost in the entropy rate. 

D. Matrix Inequalities 

We recall the following facts on positive semidefinite Hermitian matrices. Proofs can be found in 
standard references on matrix analysis (see, for example, Gantmacher [25] and Horn and Johnson [32]) 
or can be derived easily from the related results therein. 

Lemma II.6. Suppose a Hermitian matrix K is partitioned as 



where A and C are Hermitian. Further suppose C is positive definite. Then K is positive semidefinite 
if and only if A — B'C~^B is positive semidefinite. 

Lemma II.7. Suppose K G C"^" is positive semidefinite Hermitian. Then, we have 

log det K <iTK -n, 



with equality if and only if K = I, 
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Lemma II.8. Suppose K and K are positive semidefinite Hermitian of the same size. Then, we have 

iv{KK) > 0. 

Furthermore, the following statements are equivalent: 

(i) tr (KK) = 0. 

(ii) KK = 0. 

(Hi) There exist a unitary matrix Q and diagonal matrices D and D such that K = QDQ', K = QDQ', 
and DD = 0. 



III. Gaussian Nonfeedback Capacity Revisited 

Before we set off to a long discussion on the feedback capacity, we revisit the (nonfeedback) capacity 
of a stationary Gaussian channel. In particular, we give a detailed derivation of the water-filling capacity 
formula 

max{0,A-5z(e*')}— . © 

-TT ^TT 

This apparent digression will be rewarded in three ways. First, we will present an elementary proof of 
the capacity theorem that does not rely on Szego's theorem on the asymptotics of large Toeplitz matrices, 
and hence is interesting on its own. Secondly, the parallel development of both feedback and nonfeedback 
capacities answers interesting questions such as when feedback increases the capacity. Thirdly and most 
importantly, the proof techniques developed for the nonfeedback problem will be utilized heavily for the 
case of feedback in the subsequent sections. 

We start with the n-block capacity for the Gaussian channel in the Cover-Pombra sense [13]. Define 

"^-=^^^2^°^ det(K,,.)Vn (16) 

where the maximization is over all n x n positive semidefinite symmetric matrices Kx,n satisfying the 
power constraint tr(i^x,n) < it^P- The coding theorem by Cover and Pombra [13, Theorem 1] states 
that the rate C„ is achievable, that is, for every e > 0, there exists a sequence of (2"('^"~'^\ n) codes 
with Pi"^ — > 0. Conversely, for e > 0, any sequence of (2"^'^""'"'^), n) codes has pj"^ bounded away 
from zero for all n. 

The quantity nCn corresponds to the maximum mutual information 

/(x";y") = - /i(r"|x") 

between the channel input and the channel output = + Z", maximized over all Gaussian 
inputs X" ~ Nn{0, Kx,n) with tr{Kx,n) < nP. Since the Gaussian input distribution maximizes the 
output entropy h{Y"') under a given covariance constraint, nC„ is the mutual information I{X"';Y^) 
maximized over all input distributions on X" satisfying the power constraint E Y17=i -^i — ''^P- 
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Now from the stationarity of the noise process {Zj}^^, the n-block capacity C„ is superadditive in 
the sense that 

mCm + nCn < (m + n)Cm+n 

for all m and n. Indeed, if ~ 7V„+„(0, K*x,m®K*x,n) where K*x,m®K*x,n = diag(i^^,™, K*x,n) 

denotes the direct sum of the matrices K*^ ^ and K*^ ^ that achieve the capacity for block sizes m and 
n, respectively, under the power constraint P, we have 

,nCm + nCn = I{XT;Yn + y^+Y) 

= h{xT) + KxzX'l) - h{xT\Yn - 

< h{xT,xzVl) - h{xT,x^Xi\yr, O") (17) 

< (m + n)Cm+n (18) 



where Mil follows from nonnegativity of mutual information and the independence of X{" and X^ 
and (dl follows since tr(K^ /T^^^) = tr(ii'^ + tr(E:^ „) < {m + n)P and thus i^x,m®^x,n 
a feasible solution to the (m + n) -block capacity problem under the power constraint P. Consequently, 
from a classical result in analysis (see, for example, Polya and Szego [72]), the superadditivity of C„ 
implies that the limit of Cn exists and lim„ C„ = sup„ Cn- Therefore, the capacity C of the Gaussian 
channel = + Zj, i = 1, 2, . . . , is given by 

C = lim Cn 

n— >oo 

= lim max — ioff . 

n^octTKx.^<nP 2 dei{Kz,nV'"- 

In order to obtain the parametric characterization of capacity C in (Q and there is one more step 
that needs to be taken. In the classical approach, the optimization problem for Cn is solved for each n 
and then the limiting behavior of C„ is analyzed via Szego 's limit theorem. 

For each fixed 7i, the optimization problem for Cn in ilbl is well-studied; see, for example. Cover 
and Thomas [14, Section 10.5]. The optimal belongs to the same eigenspace as Kz,n, that is, if 

Kz,n has an eigenvalue decomposition Kz,n = Q^Q' with a diagonal matrix A = diag(Ai, . . . , A„) and 
a unitary matrix Q, then = QLQ' for some diagonal matrix L = diag(Zi, . . . Furthermore, 

the input eigenvalues /i, . . . , Z„ "water-fiH" the noise eigenvalues Ai, . . . , A„ in the sense that 

Z- = (A- Ai)+ = max{A - Ai,0}, i = l,...,n (19) 

where A is chosen such that 

trKl^ = Y,k = J2(X-X^)+ = nP 

i i 

Plugging „ = QLQ' into ^iB, we get 

1 V^i Ai + (A-Ai)+ 1 max{Ai,A} 

^" = ^Ei°g — X. — = ^Ei°g^^^ — • 

i=l i=l 

In fact, the optimization problem in ( fT^ is a simple instance of a matrix determinant maximization 
problem. (See Vandenberghe, Boyd, and Wu [94] for an excellent review of the matrix determinant max- 
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imization (max-det) problem with linear matrix inequality constraints.) Indeed, ignoring the subscripts, 
we can reformulate (fT6t as 

maximize log det Ky 

subject to Ky (20) 

tr{Ky - Kz) < nP. 

Now consider any z/ > and any positive definite matrix $ such that ^' := i// — $ ^ 0. From 
Lemmas III.7I and III. 81 in the previous section and constraints on Ky in (l20l . we have for any feasible 
Ky that 

logdet(i^y) < -logdet($) + iT{Ky^) - n 

= -logdet($) + iytv{Ky) - tr(J^y^) - n 

< - log det($) + u{tr{Kz) + nP) - tr(J^2^') - n 

= -\og(lei{^) + tT{Kz^) + nPv -n. (21) 



Thus, we get the following optimization problem as an upper bound on (I20t . which is another max-det 
problem: 

minimize — log det <l> + ti{(bKz) + nPv — n 

subiect to v > 

^ n (22) 

I// - $ ^ 0. 

Although we have arrived at the problem (I22t from first principles, we can easily check that this problem 
is indeed the Lagrange dual to (l20t : see Vandenberghe et al. [94, Section 3]. Moreover, both the primal 
problem (l20t and the dual problem (l22t are strictly feasible. Hence, from the standard results in convex 
optimization (Rockafellar [74, Sections 29-30] and Boyd and Vandenberghe [6, Chapter 5]), strong 
duality holds and there exist Kp ,iy*,^* satisfying illl with equality. Indeed, following the equality 
conditions for the chain of inequalities dTTT l. we find the following properties of the optimal Kp. 

Proposition III.l. The n-block capacity Cn defined in (I16t is achieved by K^ and the corresponding 
Kp = K^ + Kz if and only if both of the following conditions are satisfied: 

(23) Power: tr{K^) = nP. 

(24) Water-filling: tr{K1^{Kp - X^i^{Kp)I)) = 0. 

Although the water-filling condition i24l looks, at first, quite different from the traditional represen- 
tation. Lemma 111.81 shows that (l24l l is indeed equivalent to ( I19t . 
Once we have the parametric characterization of the capacity as 

C(A = hm — > log -— — ^ 

1=1 

n 

P{\) = lim - V(A- A,(Kz,n))+ 

n— »oo fi ^— ' 

i=l 

we can apply Szego's limit theorem and use the continuity of the capacity C in the power constraint P 
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to obtain the desired capacity formula: 

r I, ^ max{gz(e-'),A} d9 ^ 

p{\) = fi^-sz{e'')r^. m 

This standard derivation of the Gaussian channel capacity based on the first Szego theorem traces 
back to Tsybakov [92], [93] in the literature. (See also Gray [26, Section V] and Blahut [4] for a 
detailed proof.) An alternative proof was given by Hirt and Massey [30] who approximated a finite 
impulse response intersymbol interference channel (or equivalently, a finite-order autoregressive noise 
channel) by an intersymbol interference Gaussian channel with circular convolution, and analyzed the 
asymptotic eigenvalue distribution of the resulting circulant matrix. In the light of the standard technique 
of approximating a Toeplitz matrix by circulant matrices (see, for example, tutorials by Gray [26], [27]), 
the development by Hirt and Massey is essentially along the line of the traditional approach based on 
the asymptotics of large Toeplitz matrices. 

Now we give yet another proof of the Gaussian capacity theorem that does not rely on the asymptotics 
of large Toeplitz matrices. (To be fair, no proof can be totally independent of Szego's limit theorem, since 
the entropy rate of a stationary Gaussian process is given by Szego-Kolmogorov-Krein formula Q.) 
The main idea is very simple. First we spin off from (I16t and show that the capacity is achieved by a 
stationary Gaussian input process, which gives a variational formulation of the capacity as 

sup / -log y , — (25) 

where the supremum is taken over all S'x(e*^) > satisfying the power constraint 



This characterization states that the capacity of Gaussian channel is equal to the maximum information 
rate between a stationary (Gaussian) input process {Xi\'^_^ and the corresponding output process 
{^ili^-oo' equivalently, the maximum entropy rate h{y) of the output process minus the noise 
entropy rate h{Z). Hence, the variational characterization d25b can be viewed as the justification for the 
interchange of the order of maximum and limit in 

lim max-I(X";y") = sup lim -I{X'';Y'') 

n— >oo X" n j^-^^^•^n^oo n 

where the maximum on the left-hand side is over all distributions on random n-vectors X" satisfying 
^^^7=1-^1) — ^-f' while the supremum on the right-hand side is over all stationary processes {Xi} 
with EXf < P. 

Note that the finite-dimensional water-filling solution ( fT9t does not directly imply the variational 
formulation d25b . for, in general, the optimal Kp^ is not Toeplitz nor is the sequence {Kp^}^^^ 
consistent. Once we establish i25l . we will show by elementary arguments that the quantity d25l) is 
indeed equal to the water-filling capacity formula Q. Details of the proof follow. 

Define 

C := sup / -log ^Tl9\ T~ 
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where the supremum is over all Sx{e'^) > such that J^^Sx{e'^) ^ < P- 

Since Gaussian processes 

maximize the entropy rate under the second moment constraint, we have 

C= sup lim -/(X";y") 

= suvh{y)-h{Z), (27) 

where the supremums are over all stationary processes {Xi\, independent of {Zi\, and satisfying the 
power constraint EXf < P. 
We first prove 

Cn<C<Cn + - h{Z) (28) 

n 

for all n, which implies that lim„^oo Cn = C- Fix 7i and let ,„ achieve C„. We consider a two-sided 
input process that is blockwise stationary (=cyclostationary) with — oo < k < oo, 

i.i.d. ~ Nn{0, Kx n)^ ^rid is independent of the stationary Gaussian noise process {Zi}'^_^. Let Yi = 
Xi + Zi, — oo < i < oo, be the corresponding output process through the stationary Gaussian channel. 
For each t = 0, 1, . . . , n — 1, define a time-shifted process {Xi{t)}'^_^ as Xi{t) = Xi^t for all i and 
similarly define {Yi{t)}°g_^ and {Zi{t)}'^_^. Obviously, Yi{t) = Xi{t) + Zi{t) for all i. Using the 
inequality illl that was used to prove the superadditivity of C„, we have 

Cn<^I{X''";Y>'n, k = l,2,... 
kn 

and hence for all m = 1, 2, . . . , and each t = 0, . . . , n — 1, we have 

Cn<-I{X^{t);Y{^{t)) + em 
m 

= l-{h{Yr{t))-h{ZTm+em 

= ^{h{Y{-{t))-h{ZT))+e^ (29) 
m 

for some that vanishes uniformly in t as m — > oo. Here the last equality follows from the stationarity 
of Z. 

Now let T be a random variable uniformly distributed on {0,l,...,n — 1} and independent of 
{^i}i^~oo ^'^'i {-^ili^-oo- make the following observations: 



(30) {{Xi{T),Yi{T), Zi(r))}~„^ is stationary with Yi{T) = X,{T) + Zi{T) for all i. 

(31) E[Xf{T)] = E[E{Xf{T)\T)] < P. 

(32) The autocorrelation function of {Xj(T)} is banded, and hence the power spectral distribution 
of {Xi(T)} is absolutely continuous with respect to the Lebesgue measure. 

(33) The processes {Xi{T)} and {Zj(T)} are orthogonal in the sense that, for all 

E[X,{T)Z,iT)] = E[E(X,(T)Z,{T)\T)] = 0. 

(34) {Zi{T)} has the same distribution as {Zi}. 

Finally let {Xj, 1^, Zi}°^_^ be a jointly Gaussian process with the same mean and autocorrelation as 
the stationary process {Xi{T),Yi(T), Zi{T)}'^_^. Note that {Xi,Yi,Zi} also satisfies the properties 
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d30l - (l34l i. In addition, the input process {Xi} is independent of the noise process {Zi}. Hence, {Xi} 
is feasible for the maximization in illl . Now that {%} has a larger entropy rate than {Yi(T)} and the 
inequality i29l holds uniformly for any t, we continue from the inequality ( E^ to get 

Cn < -{h{Yr{T)\T) - + 
m ' 

<l{h{Yr{T))-hiZr))+e^ 
m 

<-{h{Yr)-h{Zr))+em. 
m ' 

By letting m tend to infinity, we have 

Cn < h{y) - h{z) < c 

where the last inequality follows from the definition of C in (l27l . 

For the other direction of inequality, fix e > and let the stationary Gaussian input process {Xi}'^_^ 
achieve C — e. Let {Yi]'^_^ be the corresponding output process. Since X^ trivially satisfies the power 
constraint XlILi -^i — 

Cn > -/(X"; y") = - - . 

n n 
But n~^/i(y") is decreasing in n, with Umit h{y). Hence, 

c„>M>.)-i2!)=c-. + /.m-^^. 

n n 

The desired inequality follows immediately since e > is arbitrary. Thus, we have shown that C = C. 
Now we show that the supremum of d26t is attained by 

Sx{e'') = (A - Sz{e'')Y = max{A - 5z(e''), O} 

where A is chosen to satisfy the power constraint with equality. For a parallel development with the 
feedback case in the subsequent sections, we change the optimization variable to Syie^^) and show that 
the infinite-dimensional optimization problem 

maximize /^^ log S'y (e*^ ) |^ 

subject to SY{e'% - Sz{e'^) > 0, for all 6 (35) 

j:^{Sy{e^') - Sz{e^')) ^ < P 

has the optimal solution 

SUe'') = (a - Sz{e'')y + Sz{e'') = max{Sz(e*'), A} (36) 
with A > chosen to satisfy 

j\x-Sz{e^%)^^ = P. (37) 

Note that this optimization problem is the infinite-dimensional analogue of the matrix determinant 
maximization problem (I20l i for the n-block capacity C„. However, it is often very difficult to establish 
the strong duality for the infinite-dimensional optimization problem, even when the problem is convex. 
(See Ekeland and Temam [20].) Here we avoid using the general duality theory on topological vector 
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spaces and take a rather elementary approach to duahty, which turns out to be powerful enough to 
estabUsh the optimaUty of ^^(e*^). 

Take any u > and (/)(e*^) > such that ip{e^^) := u — (j){e^^) > for all 9. Consider any feasible 
5y(e*^) satisfying the constraints for the maximization problem i35l . Since logx < x — 1 for all x > 0, 
we have 

logSrie'^) < -log0(e'^) +</.(e^^)5y(e^^) - 1 

= -log(/)(e*^) + i^Svie'^) - V'(e*^)5y(e^^) - 1 
< -log0(e*^) + i^Svie'^) - ^{e'^)Szie''^) - 1. 

By integrating both sides of the above inequality with respect to 6 and applying the constraints on 
5y(e*^) in (l35l . we obtain an upper bound of (l35l as 



This upper bound is universal in the sense that the inequality (l38l holds for any feasible Syie^^) and 
any i/ > and < 0(e*^) < z^. 

Now consider a particular choice of v = v* and (^(e*^) = 0*(e*^) with z^* = 1/A > and (/)*(e*^) = 
1/S'y(e'^) > 0, where Syi^) and A are given by (l36t and (l37l . It is easy to check that 



V,(e^^) = z.-^ - ^'^(e^^) = j^^^^^ > 0, for all 9. 



Plugging {u*,(j)*{e^^)) into the right-hand side of d38b yields 



d9 r Szje^') - Spie^') d9 ^ ^Yi^'') " ^^i^'') d9 

lOg^y^e j — + / o* /^if)\ 771 + 



2vr 5f(e*^) 27r J.^ A 27r 

j\ogSUe'<^)-+ i^-Sz{e^')r{Sz{e^')-X)+ d9 



Thus, we have shown that 



for any feasible 5y(e*^). This establishes the optimality of ^^(e*^), whence the parametric expression 
([0 for the Gaussian channel capacity C. 
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IV. Variational Characterization of Gaussian Feedback Capacity 

Given a stationary Gaussian channel Yi = Xi + Zi, i = 1, 2, . . . , with the noise spectral distribution 
dfj,z{0) = Sz{e^^)d9, we wish to prove that 

r 1 Svie^') + \l + B{e'')\^Szie^') dO 
Cfb = sup / - log ^-Tm T' ^ 

with the supremum taken over all 5y(e*^) > and all strictly causal polynomials B{e^^) = YlT=i bk^^''^ 
satisfying the power constraint 

1^ Svie^<^) + \Bie^<^)\'Sz{e^')^<P. 

We will closely follow the derivation of i25l and d28t for the nonfeedback case in the previous section. 
Again we start from the Cover-Pombra formulation of the n-block feedback capacity given by 

^-'" = i.n2^°^ detiKz,ny/^^^ ® 

where the maximum is over all positive semidefinite Ky^n and strictly lower triangular such that 
tr {Kv,n+BnKz,nBn) < nP. Again the coding theorem by Cover and Pombra states that for every e > 0, 
there exists a sequence of (2"('^™ n) feedback codes with Pi"^ — > 0. Conversely, for e > 0, any 
sequence of (2"('^™"+'),n) codes has Pi"^ bounded away from zero for all n. Tracing the development 
of Cover and Pombra backwards, we express Cfg „ as 

1 , det(i^y,„)V« 

Cfb n = max - loe; — -;— 

y"+B„Z" 2 ^ det(i^z,n)i/" 

= max - 

V^+B„Z" 

= max nV''; Y"") 

V"+B„Z" 

where the maximization is over all X" of the form X"' = + resulting in Y"- = V"' + {I + 

Bn)Z^, with strictly lower-triangular Bn and multivariate Gaussian V^, independent of Z", satisfying 
the power constraint EY^'^=i < nP. 

Before we jump into the proof of (Isjl through a detailed analysis on the asymptotics of the n-block 
feedback capacity Cfb^u, we first explore a few interesting properties of Cfb.u itself for a finite n, 
which will be useful when we discuss properties of the (infinite-dimensional) feedback capacity Cfb in 
subsequent sections. 

For a given n, finding Cf^ „ is equivalent to solving the following optimization problem: 

maximize log det(/Cy + {I + B)Kz{I + B)') 
subject to Kv ^ 

, (39) 

ii{Kv + BKzB') < nP 

B strictly lower triangular. 

Although this problem is not convex in itself (with optimization variables Ky and B), it can be easily 
reformulated into a convex problem. This relatively unknown result is due to Boyd and Ordentlich (circa 
1994), and appears as an example in Vandenberghe et al. [94, Equation (2.16)]. 

We observe that, given B, Ky = Ky + (/ + B)Kz{I + B)' is one-to-one mapped to Ky. So we 
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change the variable to {Ky,B) and rewrite ( l39t as 

maximize log det (Ky ) 



subject to Ky - (I + B)KziI + B)' to 

tr(Ky - BKz - KzB' - Kz) < nP 
B strictly lower triangular. 



Now the first constraint 



maximize 



subject to 



Ky-{I + B)Kz{I + B)'tO 
can be turned into an equivalent linear matrix inequality 

Ky I + B]^ 

from Lemma HT^l (Recall from Section |n] that Kz is nonsingular because fiz is nontrivial.) Hence, we 
obtain the Boyd-Ordentlich formulation of the n-block feedback capacity, which is another instance of 
the matrix determinant maximization problem with linear matrix inequality constraints: 

log det{Ky) 

Ky I + B'l 

{I + B)' Kz^\- (40) 

tx{Ky - BKz - KzB' - Kz) < nP 
B strictly lower triangular. 

As a simple application of the Boyd-Ordentlich reformulation of the n-block feedback capacity, we 
can easily recover the following result due to Yanagi, Chen, and Yu [103]. 

Proposition IV.l (Yanagi-Chen-Yu). For an arbitrary (not necessarily Toeplitz) noise covariance 
matrix Kz, the n-block feedback capacity CfB^n{P) i^ concave in the power constraint P. 

Proof. In the light of the Boyd-Ordentlich formulation, we write CfB,n{P) as 

/(P).= C™,„(P) = ma.ilogM|tl 

where the maximum is taken over all Ky and B satisfying the constraints in i40l . Suppose {Ky \ B^^^) 
and {Ky \ B^'^^) achieve the feedback capacity under the power constraints Pi and P2, respectively. 
Consider 

{Ky,B) = A(i^^'\i?W) + (1 - X)iK^\B^^^) 

for some A G [0, 1]. It is trivial to check that {Ky, B) satisfies the constraints in (l40l under the power 
constraint P = XPi + (1 — A)P2- Also from the concavity of logdet(-), 

logdet(is:y) > Alogdet(j4^^) + (1 - A) log det (ET^?^). 



Thus, 



□ 



The convexity of the problem, however, has more interesting implications. As an analogue to Propo- 
sition Hm we give a characterization of the optimal {Ky,B*) in the following statement. 
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Proposition IV.2. The n-block feedback capacity 

1 dei{Kv,n + {I + Bn)Kz,n{I + Bn)'Y''' pr, 

Cfb n = max - log ■ yb 

is achieved by {Kp,B*) if and only if all of the following conditions are satisfied: 

(41) Power: tr{Kp + B*Kz{B*y) = nP. 

(42) Water-filling: The covariance matrix Ky water-fills the modified noise covariance matrix [I + 
B*)Kz{I + B*y. Equivalently, 

iT{K^v{K*Y - Amin(i^y)/)) = 

where Kp = Kp + {I + B*)Kz{I + B*)'. 

(43) Orthogonality: The current input Xi is independent of the past output (Yi, . . . i.e., 
EXiYj = for all 1 < j < i < n. Equivalently, Ky + B*Kz{I + B*)' is upper triangular 

The necessity of these conditions is somewhat obvious (see Ihara [34] and Ordentlich [62]). Indeed, the 
first two conditions are needed, since for any B, the channel from to Y"- is a Gaussian nonfeedback 
channel with the noise covariance (/ + B)Kz{I + B)' and Proposition IIII. II applies. The orthogonality 
of the current input Xi and the past output (Yi, . . . ,Yi^i) is also intuitively clear; otherwise, we can 
reduce the input power for the same rate by not sending the projection of Xi onto the linear span of 
(Yi, . . . , (The receiver has that part of the information, anyway.) More precisely, we express the 

channel input as 

X" = 1/" + SZ" 
= T/" + ^Y" 

with = (I + By^V and B = {I + B)-^B, and denote each row of 5 as ^i, ... , Then, 

/(y";Y") =/(y";Y") 

n 

= 5^/(y";Y,|r-i) 

i=l 
n 

i=\ 
n 

As a consequence, when the distribution on V"^ is held fixed, /(V;Y") is independent of B. Under 
this same rate, the input power is minimized if we take 

Xi = Vi^ B^Y" = - E{Vi\Y'-^) 

for all i. Clearly, Xi is independent of Y*~^. This simple observation has been sometimes emphasized as 
the optimality of Kalman filter as the feedback information processor (see, for example, Yang, Kavcic, 
and Tatikonda [105, Theorem 1]). 

For the sufficiency (and the necessity as well) of the conditions (l4ni - (l43l in Proposition lIV.2l consider 
any u > and nxn matrices ^fi, ^'2, ^3 such that $ :^ 0, ^'i = j//— '^2+'^Kz is upper triangular, 
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and 



^1 ^2 



y 0. 



Now for any feasible B and Ky for ( RTIb . we have from Lemma HT^ that 



tr 



" I + B 

[I + B)' K,^ 

and hence from Lemma HTSl that 







^2" 




.^2 





^ = tT{KY^i + (/ + S)'^2 + + 5)*2 + ^Z^^S) > 



log dei{KY) < - log det(<I>) + tr(A'y$) - n 

= - log det(^>) + V tr(Ky) - tr(ifY^'i) - n 
< - log det(^>) + V tr{BKz + KzB' + Kz + nP) 
+ tr((/ + B)'^2 + (/ + 5)^2 + -f^z^^s) - n 
= - log det($) + 2 tr(*2) + tr(i<C^ ^^3) + 2 tr(S(^'2 + i/Kz)) + J^(tr(Kz) + nP) 
= - log det(^>) + 2 tr(^'2) + tr(A'^^^3) + v{ii{Kz) + nP) - n (44) 



n 



n 



(45) 



where the last equality follows from the triangularity conditions on B and ^2 + ^Kz- Thus, we have 
obtained the dual' problem to (l40l i. which is, once again, a matrix determinant maximization problem 
with linear matrix inequality constraints: 

minimize - log det($) + 2 ^(^-2) + tr(K-^^'3) + iy{tT{Kz) + nP) 
subject to V > 

^2 ^sj " 
^'2 + vKz upper triangular. 

As in the nonfeedback case, the optimality of any {Ky,B*) satisfying the conditions (I4n i- (l43l of 
Proposition IIV.2l follows from Slater's condition (i.e., both primal and dual problems are strictly feasible) 
and strong duality; see Vandenberghe et al.'s review [94] on the max-det problem. By checking the 
equality conditions for the chain of inequalities (l44l i. we can easily check that the duality gap is zero 
with 

^* = -(u*I-{Kpr'){I + B*)Kz 

= Kz{I + B^Yii^"! - {Kp)~^){I + B*)Kz 

and hence that the conditions (I4n i- (l43l are sufficient and necessary. 

As a historical note, we remark that Ordentlich [62] obtained the necessary conditions (l4n i- (l43l 
from a simple but elegant fixed point argument. This development, which predates the Boyd-Ordentlich 



'The optimization problem 145 > is indeed the Lagrange dual to <40t . which can be readily verified; see Vandenberghe et 
al. [94, Section 3]. 
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formulation (l40t . has certain benefits over the more refined convex optimization approach explained 
above. We will use a variant of Ordentlich's method as well as an infinite-dimensional version of the 
above convex optimization method when we characterize the optimal feedback filter in the next section. 

The first nontrivial application of the necessity of (I4ni - (l43b in Proposition IIV.2I is the following 
structural result, again, due to Ordentlich [62]. 

Corollary IV.l (Ordentlich). Suppose Kz is a covariance matrix corresponding to a stationary moving 
average process of order k, or equivalently, Kz is Toeplitz and banded with bandwidth 2k + 1 (i.e., 
Kz{i-,j) =0 if \i — j\ > k). Then the optimal Ky for the optimization problem (I39l l has rank at most 
k. 

Proof. Let K = Ky — A*/ where A* = Amin(^y)- Then from the orthogonality condition (iii), K — 
Kz{I + B*)' + A*/ is upper triangular. In other words, the strictly lower triangular part of K is equal 
to that of Kz{I + B*)' . In particular, if Kz is Toeplitz and banded with bandwidth 2k + \, K is, also 
banded with bandwidth 2k + 1, K{i,j) = Kz{i,j) if \i — j\ = k, and thus K has rank at least n — k. 
But from the water-filling condition (ii), we have rank{Ky) + rank(ir) < n. Hence, Ky has rank at 
most k. □ 

An important observation we can draw from the above proof is that the optimal output covariance 
matrix Ky is also banded with the bandwidth 2A; + 1, regardless of the block size n. Later in Section IVIll 
we will extend this observation to the ARMA noise channels and characterize the feedback capacity 
thereof. 

Proposition IIV.2I also answers the following question — when does feedback increase the n-block 
capacity? This question was completely answered by Baker [2] and Ihara and Yanagi [38], [101], 
who characterized the sufficient and necessary condition for the increment, in the context of blockwise 
whiteness of the noise covariance matrix. More specifically, for an arbitrary (not necessarily Toeplitz) 
noise covariance matrix Kz of size n x n, we define = {I k : Kz{k,l) / 0}. We say that Kz 
is white if Lk = for all k, and blockwise white if Kz is nonwhite and = for some k. When 
Kz is blockwise white, we denote by Kz the submatrix of Kz constructed by {k : 7^ 0}. Now 
the result by Baker-Ihara- Yanagi states that feedback does not increase the n-block capacity for the 
Gaussian channel with noise covariance matrix Kz under the power constraint P if and only if 

1) Kz is white, or 

2) Kz is nonwhite and P < mXm — (Ai + • • • + A^) where < Ai < A2 < • • • < A„ are eigenvalues 
of Kz and Am is the smallest eigenvalue of Kz. 

Here we give an equivalent statement, accompanied with a simple proof. 

Corollary IV.2 (Baker-Ihara- Yanagi). Suppose K^ = K^ [Kz , P) achieves the nonfeedback capacity 
Cn = Cn{Kz,P) for a given (not necessarily Toeplitz) noise covariance matrix Kz under the power 
constraint P. Then, we have 

Cn{Kz.P)=CpB,n{Kz,P) 

if and only if K^[Kz,P) is diagonal. In particular, if the noise process is stationary and nonwhite, 
then feedback increases the n-block capacity for all n and all P > 0. 

Proof. Suppose C„ = CpB^n, that is, {Ky,B*) = {K^,0) achieves the feedback capacity. Then, from 
the orthogonality condition (iii) in Proposition IIV.2I K^ = Ky + B*Kz{I + B*)' is upper triangular. 
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Since is symmetric, it must be diagonal. Conversely, we see that {Ky,B) = {K^,0) satisfies the 
conditions d4Tl - (l43t in Proposition IIV.2I whence C„ = Cfb^u- D 

From a numerical point of view, the duality result developed above gives the "solution" to the n-block 
feedback capacity problem, since there is a polynomial-time algorithm for the determinant maximization 
problem ( 1401 . based on the interior-point method. (See Nesterov and Nemirovskii [59] and Vandenberghe 
et al. [94].) In fact, Sina Zahedi [108] at Stanford University developed a numerical solver that can handle 
arbitrary covariance matrices of size, say, n = 200, with moderate computing power. 

As for the (infinite-block) feedback capacity, however, there is still much to be done. First, the above 
duality theory is for finite block-size n, however large it may be; it is another story to talk about the limit. 
Furthermore, unlike the nonfeedback case, the complicated optimality condition in Proposition IIV.2I has 
both temporal and spectral components, and consequently, it seems very difficult, if not impossible, to 
derive an analytic solution for (Ky^, B*^) even for a small n. 

Thus motivated, we move on to the main theme of this section — the variational characterization of 
the feedback capacity. 

Theorem IV.l. Suppose that the stationary Gaussian noise process {Zj}^^ has the absolutely continu- 
ous power spectral distribution d^z{G) = Sz{e^^)d6. Then, the feedback capacity Cfb of the Gaussian 
channel Yi = Xi + Zi, i = 1,2, . . . , under the power constraint P, is given by 

r 1 5y(e*^) + |1 + S(e^^)|25z(e*^) dO 
Cfb = sup / - log ■ 



5v(e'''),B(e-<') 



2 ^ Szie^^) 2tt 



where the supremum is taken over all Sy{e^^) > and all strictly causal polynomials i?(e*^) 
ELi bke'^^ satisfying the power constraint Z^{Sv{e'^) + \B{e'^)\^ Sz{e'^)) ^ < P- 

Proof. Define 

~ ri Sv{e'^) + \l + B{e''^)\^Sz{e'^) de 

Cfb = sup / - log ■ 



2 * 5z(e*^) 27r 



where the supremum is taken over all 5y(e*^) > and all B{e^^) = YlT=i bk&^^^ such that the power 
constraint /^^(5y(e*^) + |S(e*^)p5z(e*^)) |^ < P is satisfied. In the Hght of Szego-Kolmogorov-Krein 
theorem, we can express Cfb also as 

Cfb = sup h{y) - h{Z) 

where the supremum is taken over all stationary Gaussian processes {Xi]^_^ of the form Xi = 
Vi + Xlfe ^kZi^k where {Vi}°^^^ is stationary and independent of {Zi]'^_^ such that EXf < P. 
We first show that 

CfB,n < Cfb (46) 

for all n, which is not so difficult thanks to our exercise on the nonfeedback case in the previous 
section. First fix 7i and let {Ky^, B*) achieve CpB^n- Consider a process {Vi\'^_^ that is independent 
of {Zi]'^_^ and blockwise white with vj:^Xi'' , -oo < k < oo, i.i.d. ~ Nn{0,K^J. Define a 
process {Xi}^^^ as 4^+^^" = V^^^+J^" + ^r all k. And similarly, let Yi = X, + Z„ 

— oo < z < oo, be the corresponding output process through the stationary Gaussian channel. Note that 
= Vl^'ll^'" + (/ + for all k. For each t = 0,1, . . . ,n - I, define the time-shifted 

process {Vi{t)}°^_^ as Vi{t) = Vt+i for all i, and similarly define {Xi{t)}^_^, {Yi{t)}^_^, and 
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{^i(i)}*=-oo- Note that Yi{t) = Xi{t) + Zi{t) for all i and all t = 0, 1, . . . , n - 1, but Xf (t) is not 
equal to V7(t) + B*Zf{t) in general. 

Now we focus on Xf" = y2n _^ ^ B^)Zf'^. Then, 

= h{vn + h{v^i,) - hmv,^) - h{v^i,\Y^i,) 

= hiY^"") - hiZf"). 
By repeating the same argument, we have 

c™,„<-^/(yi'=";y^), 

for all k. Hence, for all m = 1, 2, . . . , and each t = 0, . . . , n — 1, we have 

Cn < ^{HYnt)) - h{Z^{t)))+em 

= -{h{Yr{t))-h{Z^))+e^ 
m 

where absorbs the edge effect and vanishes uniformly in t as m — > cxd. 

As before, we introduce a random variable T uniform on {0, 1, . . . , n — 1} and independent of 
everything else. It is easy to check the foUowings: 

(47) {Vi{T),Xi{T),Yi{T),Z,{T)}^_^ is stationary with Y,{T) = X,{T) + Z,{T). 

(48) {Xi{T)}^_^ satisfies the power constraint 

E[X^{T)] = E[E{Xf{T)\T)] = i tr(ir^_„ + B^^KzAKY) < P- 

(49) {Vi{T)}°^_^ and {Zi{T)}°g_^ are orthogonal; that is, for all 

E[V,{T)Z,{T)] = E[E{V,iT)Z,{T)\T)] = 0. 

(50) Although there is no linear relationship between {Xi{T)} and {Zi{T)}, {Xi{T)} still depends 
on {Zi{T)} in a strictly causal manner. More precisely, for all 

E[XdT)Z,{T)\Ztl+^{T)] 

= E[E{X,{T)Z,iT)\Zlzl^,iT),T) \ Ztn+i(r)] 
= E[E{Xi{T)\Zlz'^^,{T),T) . E{Z,{T)\ZlzU,{T),T) \ Zlzl^,{T)\ 
= E[E{X,{T)\Zlzl+,{T),T) ■ E{Z,{T)\Zlzl^,{T)) \ Ztl^^)\ 
= E[Xi{T)\Zlzl^,{T)] . E[Z,(T)\ZIZU,(T)], 

and for all i, 



var(X,(r) - V,{T)\ZlZU,{T)) = i?[var(X,(r) - V,{T)\ZlZU,(.T),T) ZlZU,{T) 
Roughly speaking, Xi{T) = Vi{T) + almost surely for some /. 



0. 
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(51) Since {Vi{T)} has an absolutely continuous power spectral distribution, so does {Xi(T)}. 

(52) {Zi{T)} has the same distribution as {Zi}. 

Finally, define {Vi, Xi,Yi, Zi}'?^_^ to be a jointly Gaussian process with the same mean and au- 
tocorrelation as the stationary process {Vi{T), Xi{T),Yi(T), Zi{T)}°^_^. It is easy to check that 
{Vi, Xi,Yi, Zi} also satisfies the properties (I47t - (l52t and hence that {V^} and {Zi} are independent. It 
follows from these properties and the Gaussianity of {Vi,Xi,Yi, Zi} that there exists a sequence {6fc}^Z| 
so that Xi = Vi + Y^j^ hkZi^k- Thus we have 

CFB,n < -{h{Y{"{T)\T) - h{Z^)) + em 

<l(^(y™(T))-/i(zn)+em 

<l(/i(Y--)-h(Zn)+em. 
By letting m ^ cxd and using the definition of Cfb, we obtain 

c™,„ < h{y) - h{z) < Cfb- 

For the other direction of the inequality, we use the notation Cfb{P) and CFB,n{P) to stress the depen- 
dence of feedback capacity on the power constraint P. Given e > 0, let {Xi = Vi + X^^Li t>kZi-k} 
achieve Cfb{P) — e under the power constraint P. The corresponding channel output is given as 



oo 

-oo 



Yi = Vi + Z, + Y,KZ^-k. 



k=l 



Now, we define a single-sided nonstationary process {Xi}°^^ as 

\Ui + Vi + Ylk~=i hZi-k, i < ni 
\Ui + Vi + X^fcLi hZi-k, i > m 



Xi 



where Ui,U2,-- - are i.i.d. ~ A^(0,e), independent of {Zi} and {Vi}. Thus, Xi depends causally on 
for all i. Let {Y}'iZi be the corresponding channel output Yi = Xi + Zi. Since EXf < oo for 

i < m and 



EXf = EXf + EUf = P + e 



for i > m, we have 

n 

lim - V EXf = P + e. 



n— >oo n 

i=l 



Also, since > h{U^\Y;;^_^_^) = h{U^) > -C30 and Yi = Yi + Ui for i > m, 

lim = lim > h{y) = KZ) + Cfb{P) - e. 

n—*oo n n— »oo n 

Consequently, for n sufficiently large. 
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and 

-h{Y{') - h{Z^) > Cfb{P) - 2e. 
n 

Therefore, we can conclude that 

Cra,„(P + 2e) >C'fB(P)-2e 

for n sufficiently large, whence 

liminf Cra„(P + 2e) > Cfb{P) - 2e. 

n— »oo ' 

But as a function of P, CFB,n{P) is concave on [0,oo) and hence continuous on (0, oo). (Recall 
Proposition IIV.II ) In fact, CpB^n{P) is continuous on [0,oo) because Cn{P) < CpB.n{P) < '^Cn{P) 
as shown by Cover and Pombra [13, Theorem 3] and Cn{P) ^ C„(0) = as P — > 0. For the same 
reason, liminf„_+oo CfB,n{P) is also continuous in P. Hence, by taking e ^ 0, we can get 

limini CFB,niP) > Cfb, 

n— »oo 

which, combined with d46b . implies that 

lim CfbAP) = Cfb{P). 

n—foo 

Incidentally, we have proved that the limit of Cfb^u exists, with no resort to the superadditivity of Cfb^u, 
cf. O. □ 



V. Optimal Feedback Coding Scheme 

In this section, we explore many features of the variational characterization of the Gaussian feedback 
capacity we established in Theorem lIV.il 

Cfb= sup / -log ^-TW\ T' ® 

with the supremum taken over all 5y(e*^) > and all strictly causal polynomials B{e^^) satisfying the 
power constraint 



Sv{e^') + \B{e'XSz{e'')^<P 

ZTT 



The ultimate goal is to obtain an explicit characterization of Cfb as a function of Sz and P, or 
equivalently, to solve the optimization problem 

maximize f:jog{Sv{e'') + |1 + B{e'')\^Sz{e'')) f 
subject to Sv{e'^) > 

(53) 

B{e^^) = fcfee^'^^ strictly causal 

f:^Sv{e^') + \B{e^')\^Szie^')^<P 

Recall that the optimization problem d53b is equivalent to the maximization of the entropy rate of 
the stationary process {Yi = Xi + Zi}'^_^ over all stationary processes {Xi}f^_^ of the form Xi = 
Vi + XlfcLi ^k^i-k- Whenever necessary, our discussion will resort to the context of the stationary 
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processes and the corresponding entropy rate. We start by studying the properties of an optimal solution 

{S^{e''^),B*{e'^)) to cf. Proposition 

Proposition V.l (Necessary condition for an optimal {Sy, B*)). An optimal solution (S'y (e*^), B*{e^^)) 
to (I53t . if one exists, must satisfy all of the following conditions: 

(54) Power: /"^ S\,{e'^) + \B\e'^)\^ Sz{e}^) ^ = P- 

(55) Water-filling: Sy{e^^) water-fills the modified noise spectrum |1 + i?*(e*^)p5z(e*^), that is, 

S^{e'^){Sp{e'^) - X*) = a.e. 

where S^{e'^) = S^{e'^) + |1 + B*{e'^)\'^Sz{e'^) and X* = essinf0g[„^,^) S^i^. 

(56) Orthogonality: The current input Xn is independent of the past output {^}"^- Equivalently, 

5^(e*^) + B''{e'^)Sz{e'^){l + B\e-'%) 

is anticausal. 

Furthermore, if Sz{e^^) is bounded away from zero, i.e., essinfg Sz{e^^) > 0, then there exist Sy{e^^) G 
Li and i?*(e*^) = Ylli=i^k^^^^ ^ -^2 attaining the maximum of (I53t . 

Proof. Necessity of the first two conditions is obvious; since each fixed B gives a nonfeedback channel 
|1 + i?(e*^)p5^(e*^) with the input spectrum 5y(e*^), the optimality conditions for the nonfeedback 
capacity in Section Hill applv. 

For the orthogonality condition (l56l . we modify a fixed-point method^ for the finite-dimensional case 
by Ordentlich [62]. Suppose (5^, B) is optimal and Sy + B{1 + B)Sz is not anticausal. Then 



{Sy + B{l + B)Sz)e-'^'—=^^^ 



for some n > 1. Let A{z) = xz" with |x| < 1. Then {Sv,B) = {\l + A\'^ Sy ,{l + A){1 + B) - 1) is 
another feasible solution to (I53t . Since the corresponding output spectrum 

Sy = Sv + \1 + B\^Sz = |1 + A\^Sv + |1 + + B\^Sz = |1 + A\^Sy, 

the entropy rate stays the same for Sy by Jensen's formula Q. On the other hand, the power usage 
becomes 

Pix)= 1^ Sv + \B\^Sz^ 

\l + A\''Sv + \A{l + B) + B\^Sz:^ 

In 

= 1^ Sv + \B\'Sz^ + 2 A{Sv + B{l + B)Sz)^ + f \A\\Sv + \1 + B\'Sz) ^ 
= P + 27X + Pyx^ 

where Py = f^^ Sy ^ is the original output power. Since P{x) is quadratic in x with the leading 
coefficient Py > 0, we can choose x small with appropriate sign so that P{x) < P. But this implies 

^This material on the fixed-point characterization of the optimal (if^/ ^, B^) was delivered with rigorous details at ISIT 1994 
by Ordentlich, but it never appeared in the conference proceedings or other places. 
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that {Sv,B) achieves the same entropy rate as the original {Sv,B) using strictly less power. This 
contradicts the optimality of {Sy, B) and hence we have the anticausality of Sy + B{1 + B)Sz- 

The proof of the existence of the optimal {Sy,B*) is rather technical, so it will be given in the 
Appendix. □ 

Unlike the finite-dimensional case, the conditions (I54li-(I5^ are not necessarily sufficient; one can 
easily construct a suboptimal {Sv,B) satisfying the above conditions. Nonetheless, we can deduce 
many interesting observations from them. 

Corollary V.l. Feedback does not increase the capacity if and only if the noise spectrum is white, i.e., 
Sz{e^^) is constant. 

Proof. Shannon's 1956 paper [81] shows that feedback does not increase the capacity for memoryless 
channels, taking care of the sufficiency. (See also Kadota, Zakai, and Ziv [39], [40].) 

For the necessity, we assume that Sz is bounded away from zero without loss of generality. Indeed, 
we can use a small amount of power to water-fill the spectrum first, then use the remaining power to 
code with or without feedback. If the stated claim is true, then feedback increases the capacity for the 
modified channel and hence for the original channel. (For the nonfeedback coding, there is no loss of 
optimality in dividing the power into two parts and water-filling successively.) 

Proceeding on to the proof of the necessity, suppose 5^(e*^) achieves the nonfeedback capacity and 
hence (5y(e*^), i?*(e*^)) = (S'^(e*^),0) achieves the feedback capacity. Then, from the condition (l56l . 

S^(e^^) + B*{e'^)Sz{e'^){l + B*{e-'^)) = S^(e^^) 

is anticausal and hence is white. Therefore, Sz{e^^) must be also white. □ 

Corollary V.2. Suppose (5y (e*^), -B*(e*^)) attains the maximum of i53l . Then, there exists B**[e^^) 
such that 

S^{e'^) = S\,{e'^) + |1 + B*{e'^)\^Sz{e''^) = |1 + B*\e''^)\^ Sz{e'^) 

and 

f S^e^') + \B*{e^XSz{e'') ^ = f {B^^'ie^TSzie'') ^. 

In particular, (0, i?**(e*^)) attains the maximum of (I53t . 

In order to prove Corollary IV. 21 we need the following simple result, which essentially establishes 
the optimality of the original Schalkwijk-Kailath coding scheme for the additive white Gaussian noise 
channel. 

Lemma V.l. Suppose the noise spectrum is white with Sz{e^^) = N. Then, the choice of Sy{e^^) = 
and 

B*{e'^) = \"""'f - 1 
^ ^ 1 - ae*^ 

with a = y'^p^^ achieves the feedback capacity Cfb — ^ — \ + ^) ^^der the power constraint 

P. Furthermore, the resulting output spectrum is given by 

5^(e*^) = P + N. 
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Proof. We first check that 

5* (z) = iV(l + B\z))il + B*{z~^)) = N ■ • V~" = iVa-2 = P + iV. 



1 — az 1 — az 



On the other hand, since 



fc=i 



we have 



2\2 



iV(l-a"") 



a 



fc=i 

2 



2 rf^ 

2^ 



1-a 



2 



= P. 

Clearly, we have achieved Cfb{P) = i log (l + ^) • ^ 

The choice of the feedback filter i?*(e*^) is far from unique; for example, we can use any causal filter 
derived from the normalized Blaschke product as 

11 1 _ a,z^>^ 

where {jk}'kLi arbitrary sequence of positive integers and {ofclfc^i is a sequence of real numbers 
such that Iflfcl < 1 for all k and UkLi = N/{P + N). (We will prove the optimality of these feedback 
filters later in the next section.) Note that there are filters that are not covered by the form \51\ . but stiU 
achieve the capacity for white spectrum. 

Now we move on to the proof of Corollary IV.2I 

Proof of Corollary \V.2\ Suppose 



and 



\B\e'')\'Sz{e'') — = P-Px. 
zvr 



We assume Pi > 0; otherwise, there is nothing to prove. 

We argue that 5'(e*^) := |1 + P*(e*^)pS'2(e*^) must be white. Assume the contrary and consider the 
Gaussian feedback channel with the noise spectrum 5(e*^) under the power constraint P\. But from 
Corollarv IV. II (S'y(e*^),0) is strictly dominated by some feedback coding scheme (^^(e*^), P(e*^)) 
with nonzero P(e*^). Hence, for the original channel, we have a two-stage strategy (5*^(6*^), (1 + 
P*(e*^))(l + i?(e*^)) — 1) with the corresponding output entropy higher than that of the original 5y (e*^), 
which contradicts the optimality of (S'y (e*^), i?*(e*^)). 

Now suppose the white spectrum S{e}^) has the power, say. A'"!. From the water-filling condition (I55t . 
Sy = Pi and the resulting output spectrum ^^(e*^) = Pi + A^i. On the other hand, from Lemma fvm 
we can achieve the feedback capacity ^ log(l + ^) for the new channel 5(e*^) by using P(e*^) = 
(1 - a-^e*^)/(l - ae'^), a = y/Ih/y/Pi + Ni. Consequently, we can achieve the feedback capacity of 
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the original channel S'^(e*^) through a two-stage strategy: first transform the channel into 5(e*^) using 
i?*(e'^), and then use i?(e*^) for the channel 5(e*^). The corresponding one-stage filter is given by 

B**{e''>) = (1 + B*{e'^)){l + B(e'^)) - 1 

and (0,S**(e*^)) achieves the feedback capacity with the same output spectrum 5*^(6*^). □ 

Remark V.l. We can make a somewhat stronger statement — if Sz is nonwhite, then Sp must be zero. 
To see this, first note from the above proof that, if Sy is nonzero, then |1 + B*\'^Sz and Sy, as well as 
Sp should be white. Now from the orthogonality condition ( 15^ . Sy + B*Sz{l + B*) is anticausal, or 
equivalently, Sz{l + B*) is anticausal, which is true only if Sz is white. 

The essential content of Corollary IV. 21 is that we can restrict attention to the solutions of the form 
(0,i?(e*^)), even in the case the supremum in ( 15^ is not attainable. Indeed, we can easily modify the 
proof of CoroUarv IV. 21 to show that for any solution {Sv,B), there exists another solution {0,B) such 
that the corresponding output entropy rate is no less than the original under the same power usage. This 
observation yields a simpler characterization of the feedback capacity. 

Theorem V.l. Suppose that the stationary Gaussian noise process {Z^}^^ has the absolutely continuous 
power spectral distribution dfiz{(^) = Sz{e^^)d9. Then, the feedback capacity Cfb of the Gaussian 
channel Yi = Xi + Zi, i = 1,2, . . . , under the power constraint P, is given by 

r I \l + B{e'')\^Sz{e^') dO T ^ i li ^ W ^^^|2 r^s^ 

Cfb= sup / -log y , if). Tr= sup / -log|l + 5(e )| — (58) 

where the supremum is taken over all strictly causal polynomials B{e^^) = 'Y^^=i b^e^^^ satisfying the 
power constraint ^ /^^ |S(e^^)p5z(e*^) dO < P. 

Although Proposition IV. II and its corollaries reveal the structure of the capacity-achieving feedback 
filter, it is still short of characterizing the capacity-achieving feedback filter itself. For example, we can 
show that there are more than one feedback filter B satisfying the orthogonality condition. We remedy 
the situation by deriving a universal upper bound on the feedback capacity and finding the condition 
under which this upper bound is tight. 

We begin with a program similar to the one at the end of Section |ffl] We will assume that Sz{e^^) 
is bounded away from zero, which does not incur much loss of generality, for we can always perturb 
the noise spectrum with little power without changing the output entropy rate by much. (Also recall 
that the condition for existence of an optimal solution in Proposition IV. 11 ) From the canonical spectral 
factorization theorem, we write 5^(6*^) = Hz{e^^)Hz{e~^^) with Hz G -f^2- Since Sz{e^^) is bounded 
away from zero, 1/Hz G -f^oo- 

Under the change of variable ^^(e*^) = 5y(e*^) + |1 + i?(e*^)pS'^(e*^), we rewrite the optimization 
problem d53t as 

maximize [ log Sy {e^^ ) — 
J~n 27r 

subject to Srie'^) > |1 + B{e''^)\^Sz{e'^) .59. 
Svie^') - {B{e'') + B{e-'') + l)Sz{e'') — <P 
B{e^^) G H2 strictly causal. 
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Take any > 0, 4',il)i ^ Lao, and 1/^2, "03 ^ -^1 such that </) > 0, log0 G Li, ijj2Hz^ G L2, "01 
z/ — (/)>0, A := 1^2 + i^Sz is anticausal, and 



Now that any feasible B{e^^) and 5y(e*^) satisfy 



1 + S(e*«) ^^^(e*^) 



^0, 



we have from Lemma IIL8I that 

Sy l + B 



tr 



l + B Sz^ 





"^1 


V'2 




_ -02 


V'3 _ 



i^iSY + Ml + B) + Ml + B) + i'zSz > 0. 



Proceeding as in the nonfeedback case, we invoke the inequality logx < x — 1 for all x > with 

X = (^Sy to get 

log Sy < — log (j) + 't'SY — I 

= — log + vSy — iPiSy — I 

< - logcp + uSy + Ml + B) + Ml + B) + il^sSz^ - I. (60) 

Furthermore, since A = ^2 + i^Sz is anticausal and B is strictly causal, AB G Li is strictly anticausal; 
recall Lemma llT31 (Indeed, 1P2B = (?/'2-ff^^) • (HzB) G Li since the first factor is L2 while the second 
factor is H2.) Hence 



By integrating both sides of d60b . we get 



d9_ 
2^ 



A{e'^)B{e 



2^ 



0. 



(61) 



log Sy < 



log^ + uSy + Ml + B) + Ml + B) + tPsSz - I 



< / -log(j) + i^{{B + B + l)Sz + P)+Ml + B) + Ml + B)+i^3Sz^ -I 



log</. + V2 + V'2 + ipsSz^ + v^Sz + P)-l + AB + AB 



■ log(/) + V2 + V'2 + i^zS'z^ + v[Sz + P)-l 



where the second inequality follows from the power constraint in ( I59t and the last equality follows from 
(ED- 

In summary, we have derived a general upper bound on the feedback capacity: 

Proposition V.2. Suppose the noise power spectral density Szie"^^) is bounded away from zero and has 
the canonical spectral factorization Sz{e^^) = \Hz{e^^)\'^. Then, the feedback capacity Cfb under the 
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power constraint P is upper bounded by 



Cfb < - 



log 



Sz{e" 



Sz{e 



^0^ 



for any v > 0, 0, G Lqo, and il^2,'4'3 G Li such that 

> 
log (p £ Li 

ipi = I' — (j) > 
ip2 + T-'Sz is anticausal 



+ u{Sz{e'')+P)-l 



de 



and 



Me'') Me'') 



Me'') Me'- 



y 0. 



As before, for us, the major utility of this upper bound lies in the characterization of the optimal 
solution B*{e^'). Tracing the equality conditions in d62b . we can establish the following sufficient 
condition for the optimality of a specific feedback filter B{e^'). 

Proposition V.3. Suppose Sz{e'') is bounded away from zero. Suppose B{z) G H2 is strictly causal 
(i.e., B(0) = 0) with 

(63) 



de 



\B{e'^)\'Sz{e'') — = P. 

ZTT 



If there exists A > such that 



(64) A < essinf |1 + 5(e*'')|^5z(e' 

6»G[-7r,7r) 



and that 
(65) 



A 



l + Bie 



-ie\ 



B{e^ )Sz{e' ) is anticausal, 



then B{e ) achieves the feedback capacity; that is, B{e ) achieves the maximum of 



CpB = niax 

_B{e*») 



" ilog|l + B(e'^)|2^ 
.^2 ^' ^ ^' 27r 



over all strictly causal B{e'') satisfying JZ^ \B{e'')\'^ Sz{e'') ^ < P- 
Proof Let Sy = |1 + B\^Sz > A > 0. Let 

1/ = 1/A > 



Sy^ G Lc 



(66) 



1^1 = 1^ -(, 
^P2 = -M^ + B)SzeLi 



1 1 . 
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and 

V^3 = ^?|i + s|25|gLi. 

It is straightforward to verify that u, cl),tl;i,Tp2, and -03 defined above satisfy the conditions set forth for 
the upper bound d62b . Moreover, from the condition i65l . 

Me") + vSzie") = -Kl + B(e«'))Sz(e-") + ^ ^ + vSzW) 

is anticausal. 

Now, it is easy to check that 

^iSy + 02(1 + B)= Ml + B)+ Vs^^^ = 

which makes the second inequality of i60l an equality. On the other hand, d66t makes the first inequality 
of ibOl an equality while the condition makes the second inequality in an equality. Combining 
these three equality conditions, we have the equality in d62b . and hence the optimality of B{e^^). □ 

Note that the causality condition (l65l in the above proposition implies the orthogonality condition (l56l 
in Proposition ED for, if X/{1 + B{e-'^)) - B{e'^)Sz{e'^) is anticausal, B{e'^)Sz{e'^)il + B{e-'^)) is 
anticausal. The converse is not necessarily true and hence there is a nontrivial gap between the necessary 
conditions in Proposition IV. II and the sufficient condition in Proposition IV.3I 

Although the conditions (I63l l- (l65l give a characterization of the optimal feedback filter, this character- 
ization is rather implicit and still falls short of yielding what can be called a closed-form solution for the 
feedback capacity problem iSSl . In the next two sections, we find more explicit answers by narrowing 
attention to special classes of noise spectra. 



VI. First-order ARMA noise spectrum 

As a gentle start, we consider the zeroth-order autoregressive moving average (=white) noise spectrum 
first. Since the spectrum is bounded away from zero, from Theorem IV. II and Proposition lV.3l the feedback 
capacity is characterized in the following variational formula: 



/vr -1 JQ 



where the maximum is taken over all strictly causal filters B{e^^) = Yl'kLi bk^^^^ satisfying the power 
constraint ^ /^^ |5(e*^)p52(e'^) dO < P. In Lemma IvTl we already proved that 

B{z) = — ^ - 1, -1< a < 1 

1 — az 

achieves the feedback capacity for the noise spectrum Sz = N under the power constraint P = N{l/a'^ — 
1). Here we establish the optimality of filters of the form 

k=i " 
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where {jk}kLi arbitrary sequence of positive integers and {a^}^^ is a sequence of real numbers 
such that |afc| < 1 for all k and f]^^ o| = N/{P + N). 

Although we can employ direct brute-force calculation similar to the proof of Lemma IV. II we apply 
Proposition IV. 3 1 as an elegant alternative. Observe that the resulting output spectrum is 



SYiz) 



k=l 



akz^" 1 - akz > 



k=l 



P + N. 



Take A = min|^|=i 5y(z) = 1/(P + N) = iVnfcli(V4)- Then, 



A 



l + B{z~^) 



B{z)Sz{z) 




k=l 



akzi" 



= N. 

Hence, the feedback filter B{z) given in (l67t satisfies the sufficient condition in Proposition IV.3I which 
confirms its optimality. 

Now we turn our attention to the first-order autoregressive moving average noise spectrum Sz{e^^), 
defined by 

l + oe'' 



1 + pe 



id 



(68) 



for a G [—1,1] and (5 G (—1,1). (The case \a\ > 1 can be taken care of by the canonical spectral 
factorization and proper scaling.) This spectral density con^esponds to the stationary noise process given 
by 

Zi + pZi_l = U^ + aU^.l, i£Z 



where {Ui}l 



is a white Gaussian process with zero mean and unit variance. 



Theorem VI.l. Suppose the noise process {Zi}'^^ has the power spectral density Sz{e^^) defined in 
(16 St . Then, the feedback capacity Cfg of the Gaussian channel Yi = Xi + Zi, i = 1, 2, ... , under the 
power constraint P, is given by 

Cfb = - log Xq 



where xq is the unique positive root of the fourth-order polynomial 

(l-x2)(l + aax 



Px' 



{l + af3x)^ 



(69) 



and 



a = sgn(/3 — a) = < 



1, P > a 
0, (3 = a 
-1, P < a. 



Proof. Without loss of generality, we assume \a\ < 1; for the case \a\ = 1, we can perturb the noise 
spectrum with small power to transform it into another ARMA(l) spectrum with \a\ < 1. Under the 
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assumption \a\ < 1, Szie^^) is bounded away from zero, so we can apply Proposition IV. 31 
Here is the bare-bone summary of the proof: We will take the feedback filter of the form 

B(.) = \^ . (70) 
\ + az 1 — axz 

where x G (0, 1) is an arbitrary parameter corresponding to each power constraint P G (0, oo) under 
the choice of 

x^-l l + aax (1 + I3ax\ 

y = -—T. — = -P(^x T— ■ (71) 

ax 1 + pax V 1 + Oi'^^ J 

Then, we can show that B{z) satisfies the sufficient condition in Proposition IV. 3 1 under the power 
constraint 

dO r y2 y2 l-x"^ (l + aax^'^ 



|1 — xe*^p 27r 1 — x"^ x^ \1 + Pax 
with the corresponding information rate given by 

^-log|l + i?(e^^)|2- = --logx2. 

The rest of the proof is the actual implementation of this idea. 

Assume — 1 < a < /3 < 1. Given x G (0, 1), take y as in (fTTl . Then, we can factor 1 + B{z) as 

l + Biz) = l+'-±^.^^ 
1 + az 1 — xz 

1 — {a — X + y)z + {Py — ax)z'^ 

(1 + az){l — xz) 

(1 + (ax — Py)xz){l — x~^z) 

(1 + az){l — xz) 

(1 — rz){l — x~^z) 

(1 + az){l — xz) 

where r = —{ax — Py)x. The corresponding output spectrum is given by 

Sy{z) = (1 + B{z))Sz{z){l + B{z~^)) 

_ (1 - rz){l - x-^z) (1 + az) (1 + az'^) (1 - rz-^){l - x"^z-^) 
~ {l+az){l-xz) {1 + Pz) {1 + pz-^) + 
1 1 — rz 1 — rz~^ 



x'^l + pzl + Pz-^ ■ 
We first check that |r| < 1. Indeed, from (17 11 1, we can express r = r{x) as 



(72) 



{P- a)x'^ - aPx - P 

'^'^ = TTTx ^^'^ 

= -P+(P-a)-^ (74) 
When p>0,0<{P + x)/iP + x'^) < 1 (recall < x < 1) so that — r is a convex combination of 
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a and /3; hence -1 < r < 1. When /3 < 0, we differentiate to find that r'(0) < 0, r'(l) > 0, 
max^g[o^i] r{x) = max{r(0), r(l)}, and that there exists a unique x* S (0, 1) attaining the minimum of 
r(x) on [0, 1]. Since r(0) = — /? and r(l) = —a, it sufficies to check that r{x*) > — 1. We have 

0= (^^[(p - a)x^ - a(3x - p)^ (l + /3x)- {{P - a)x'^ - apx - P) -^{1 + px) 

= {2{P - a)x - a/3)(l + px) - P{{p - a)x^ - aPx - p) 
= {P-a){Px^ + 2x + P) 

at X = x*, whence 

, , , 2(p — a)x* — aP ^ . / *n2 ^ ^ 
r(x*) = — ^ = 2x* + a{x*Y > 0. 

Therefore, \r\ < 1. 
Now let 

1 / (1 — rx)(l — rx^ ) \ 1 / (1 — rx)(x — r) 



A = Sy{x) = 



X2 + px){l + Px~^) J X2 + px){x + p) 

We will show that 



< A < mill 5y(e*^). 

6le[-7r,7r) 

For the positivity of A, it suffices to show that (x — r)/(x + P) is positive. From il4l . we have 

-'■ = (-+«(i-^) = (-+/')(;^) (") 

so that {x — r)/{x + P) is positive. (The case x + /3 = is trivial since r(— /5) = — 
The upper bound requires a little more work. Let 

_ (1 + r^) - 2ru 

for — oo < ?i < oo. Then, we can express 
for G [— 7r,7r) and similarly express 



Sy{x) = f 



X + x ^ 



Since the linear fractional function f{u) does not have a singularity in [—1, 1], the minimum occurs at 
one of the end points and 

minSy(e*^) =min/(cos0) = min{/(l), /(-I)}. 
6 

We consider different cases. 

Case 1: /3 > 0. Then, f{u) is decreasing on (— + oo) since 

iP + r){l + Pr) 



(l+/?2) + 2/?n 
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and 



(3 + r = {P-a) 



f3 + x 



is positive. (Recall the standing assumption /3 — a > 0.) By Jensen's inequality, 



X + X 



so that 



/(-I) >/(!)>/ 



> 1 



X + X 



A. 



Case 2 
Case 3 
Case 4 



< —(3 < X. Same as the previous case since /3 + r is positive. 

< — /? = X. As we saw before, r = —f3 so that f{u) is constant for all u. 

< x < -/?. Since f'{u) > with a singularity at + /3~^)/2 > 1 and 



X + X 



> 



we have 



/ 



X + X 



< ml{f{n) : u < -{p + p-^)/2} < min{/(l), /(-I)}. 



Therefore, B{e^^) and A satisfy the condition (l64b in Proposition IV. 31 
Finally we check the condition d65b . namely, anticausality of 

(1 + Q2;~^)(1 -xz^^) \ {l + az^^){yz) 



A 



B{z)Sz{z) = A 



= (1 + az~^ 

From (TtT]) and dTSl) . we have 

A(l + /3x)(x2 - 1) - {yx~^){l - rx) 



(1 - rz-i)(l - x-iz-i y (1 + - xz) 

A(l + - xz) - {yz){l - rz~^) 



(l-rz-i)(l + /3z-i)(l-xz) 
1 — rx ( {x — r)(x^ — 1) (x^ — 1)(1 + ax 



x + 13 



(1 — rx)(x^ — 1) /x 



1 + /3x 
1 + ax' 



X + /? 1 + ^x 



(76) 



Hence, the numerator of (l76l has a factor (1 — xz), so that (1761 is anticausal and the condition d65t is 
satisfied. This establishes the optimality of B{z) defined in (I70t with < x < 1 and y satisfying (17 It . 
From Jensen's formula (|9ll, we see that the corresponding feedback capacity is given by 



under the power constraint 



Cfb{x) 



P{x) 



^^logSHe^^)^ = -^logx^ 



y 



(1 - x2)(l + ax) 



1 - x2 x2(l + /3x)2 ■ 

The case f3 < a can be treated similarly with x < 0, while the case (3 = a (i.e., Sz = 1) is trivial. 



This completes the proof of Theorem IVI.ll 



□ 
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Now we interpret in several ways the optimal feedback filter B*{z) we found in (I70t . First, we show 
that the celebrated Schalkwijk-Kailath signaling scheme is asymptotically equivalent to our feedback 
filter B*, establishing the optimality of the Schalkwijk-Kailath coding scheme for the ARMA(l) noise 
spectrum. 

Consider the following coding scheme. Let V ~ A^(0, 1). Over the Gaussian channel Yi = Xi + Zi 
with the noise spectral density 



the transmitter initially sends 



1 + 



1 + f3& 

Xi = V {11) 



and subsequently sends 

X„ = (ax)-("-i)(y-K_i), n = 2,3,... (78) 
where a = sgn(/3 — a), x is the unique positive root of the fourth-order polynomial d69b . and 

Vn = Vn{Y''')=E{V\Yi,...,Yn) 



is the minimum mean-squared error estimate of V given the channel output signals y" = (Yi, . . . ,Ynj. 
For all m < n, we have 



n-l 



Z„ «-(/?- a) Y,{-(^?~^Zn-k + Un 



k=l 



Xn = {V - E{V\Y'^-^) + E{V\Y'^-^) - E{V\Y'^-^)) 

= {axT-'-iX^ - E{X^\Y'^-^)) (79) 
= - Z„ - E{Ym - Zm\Y^-^)) 

= -{axr-^{Zm - E{Zm\Y^-^)). (80) 
Furthermore, since Z„ = — /?Z„„i + C/„ + aUn~i with white {Ui\, we can show that 



for large n, which, combined with dBOt . implies that 

Z„ - E{Zn\Y''-^) « ( \ Xn + Un (81) 

\a + {ax)~^ J 

for large n. When a ^ (3, that is, when the noise spectrum is nonwhite, dSTt is equivalent to 

Xn « « + (^^) \ E{Zn\Z^-^) - E{Zn\Y^"^)). (82) 
p — a 

Now from ( I79l l with m = n — 1 and the orthogonality of X„_i and 

= - E(x„„i|y"-2) - E(x„_i|y„_i)) 

= {ax)-^{Xn-l - E{Xn-l\Yn-l)) (83) 

where Yn-i := yn-i — -E'(yn-i|y"~^) is the innovation of the output process at time n — 1. Also from 
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dSTT l and the orthogonality of and ^, we have 

cXn-i + 

where 

p — a 1 + jjax 

c = l-\ — r-r = • 

a + [ax) ^ 1 + aax 

Finally, returning to (l83l . we can easily see that 

c^F + 1 

= axXn^l - yUn-l, 

where x and y are the constants given by d69b and dTlT) . Therefore, the feedback coding scheme given 
by ( 1771 and ( 1781 is asymptotically equivalent to filtering the noise through the feedback filter 

B[z) = — , 

I + az 1 — axz 

which is exactly equal to the optimal feedback filter dTOb we found in the proof of Theorem IVI.ll 
For a more rigorous analysis, we can also show that 

liminf^/(y;K) > i log {\ 

while 

1 " 

limsup-^Xf < P 

1=1 

under the coding scheme dTHt . Recall that 

= - K_i(r"-i)) + Zn 

and define 

K = {Yn + + + (ax)-("-2)v;_2), n > 2, 

and 

n 

Yn = Y.(-^r~'Yk, n > 1. 
fe=l 

Clearly, Y" can be represented as a linear combination of Yi, . . . , y„ and therefore, for any C2, C3, . . ., 

E{V - Vn? <e(v- c^y,"^ ^ . (84) 
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Now we express 

= {axy^''^^\l + aPx)V + Un + aUn-i 



and 



y:^ = dnV+Un + {-ar-'u' 

where U' = alio — pZo and 

/ n N 

= (l+apx) y(-a)"-'=(ax)-(^-i) 



1 + 



^) (l-(-aQxr)((7x)-("-^). 



1 + aax J 

By taking = in (I84t . we can easily verify that 

^(^-(EL2 4n"))' 1 



whence 



and 



lim sup - log E'(y - K)^ < log ( 

n— >oo \ X 



liminf^/(y;K) > ^ log 



On the other hand, 




which converges to 



^-2(n-l) + 2 

lim ^— = — ■ (x - 1) = P. 

The coding scheme described above uses the minimum mean-square error decoding of the message V, 
or equivalently, the joint typicality decoding of the Gaussian random codeword V, based on the general 
asymptotic equipartition property of Gaussian processes shown by Cover and Pombra [13, Theorem 2]. 
It is fairly straightforward to transform the Gaussian coding scheme to the original Schalkwijk-Kailath 
coding scheme. Here we sketch the standard procedure. A detailed analysis is given in Butman [7], [8]. 

Instead of the Gaussian codebook V, the transmitter initially sends a real number 6 that is chosen 
from some equally spaced signal constellation 0, say, 

G = {-1, -1 + (5, -1 + 2(5, . . . , 1 - 25, 1 - (5, 1}, 6 = Ij^T^ - 1) 

and subsequently sends — 9n (up to the same scaling as before) at time n, where On is the minimum 
variance unbiased linear estimate of 6 given y"~^. Now we can verify that the optimal maximum- 
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likelihood decoding is equivalent to find 9* £ @ that is closest to On, which results in the error probability 



Pi") < erfc ^coXq 2"/22niJ 
where xq is the unique positive root of ( l69t . cq is a constant independent of n, and 



2 yCXD 

erfc(x) = / exp(— 



is the complementary error function. Now we can easily see that decays doubly exponentially fast as 
far as ii < — legato = Cfb- Finally note that the doubly exponential decay of error probability can be 
raised to an arbitrary higher order by modifying the adaptive power allocation scheme by Pinsker [70], 
Kramer [48], and Zigangirov [109]. Also note that ( I79t . iSOl . and (l82t give interesting alternative 
interpretations of the SchaLkwijk-Kailath coding scheme; the optimal transmitter refines the receiver's 
knowledge of any past input ( I79t . or equivalently, any past noise dSOb . Also asymptotically, the optimal 
transmitter sends the difference between what he knows about the upcoming noise and what the receiver 
knows about it (l82t . 

Before we move on to a more general class of noise spectra, we provide another angle on the optimal 
coding scheme by considering the following state-space model of the ARMA(l) noise process: 

Sn+l = ~PSn + Un 

Zn = (a- (5)Sn + Un 

where {C/j}^Q are independent and identically distributed zero-mean unit-variance Gaussian random 
variables, and the state Sn is independent of Un for each n. It is easy to check that this state-space 
model represents the noise spectrum 

2 



Sz{e'') = \Hz{e 



ie\\2 



1 + ae" 



1 + /3e*' 

For simplicity, we consider a slightly nonstationary noise model by assuming that Sq = Uq = 0. 
One can prove that this does not change the feedback capacity [45, Appendix], which implies that the 
Gaussian feedback channel Yi = Xi + Zi with the noise spectrum is asymptotically equivalent to 
the intersymbol interference channel 

k 

Yl = Y,9k-]Xj + Uk 
i=i 

where {gk}kLo the Fourier coefficient of the whitening filter 

r(p^^) = ^ = ^ ^^^^ 

and {Uk}'^i is the white innovations process. 

Consider the following coding scheme, which is "stationary" from time 2. At time 1, the transmitter 
sends Xi = F ~ -^(0, cry) to learn Ui = Yi — Xi and subsequently sends 

Xn = x{Sn-E{Sn\Y{'-^)), n = 2,3, . . . (85) 

where 

1 — aax 

X = 

ax 
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and X is the unique positive parameter satisfying the capacity polynomial 

(1 + apxY 

We can easily prove the optimality of this coding scheme from our previous analysis of the coding 
scheme dTSb . Indeed, it is straightforward to transform the refinement of the message V in (l78l to the 
refinement of the noise state Sn in dSSl and vice versa. However, the direct analysis has two important 
benefits. First, as we will see in the next section, the optimal feedback coding scheme for a general 
finite-order ARMA channel can be represented most naturally as the refinement of current noise state. 
Second, we can interpret the role of the message bearing signal y as a perturbation to boost the output 
entropy rate; see Subsection III-CI 

For the analysis of the coding scheme dSSl . we introduce the notation 

Sn — Sn ~ E{^Sn\Y-^ ) = 5"^ — Sn 

and similarly define Yn = E{Yn\Y^~'^) and Yn = ^n — Under this notation, we can express 
the channel output as 

= XSn + (a - f3)Sn + Un 

= {a-P + x)Sn + {a- [i)Sn + Un- 
Let Un = EY^ and = ES^- Then, we have 

Sn+l = -^'('S'n+ll^l") 

= EiSn+ilY^'-'X) 

= EiSn+llY^'-') + E{Sn+l\Yn) 

= E{-(3Sn + C/nIn""') + E{-PSn + Un\Yn) 

= -PSn + JnYn, 

where ^ 

7„ = —{-P{a-P + x)sl + l). 

From this we get the state-space model for Yn as 

Sn+l = {-P - 7n(" - /? + X))Sn + (1 " 7n)C4 
Yn = {a-P + x)Sn + Un, 

which implies the following recursive relationship for cr^ and for n > 2: 

al = l + {a-l3 + xfsl 
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and 

4+1 = (/3 + 7n(a -(3 + x)fsl + (1 - 7„)2 

^ .2 2 , . _ (-/?(a-/3 + x)4 + l)' 
^ " l + (a-/3 + x)24 ■ 

It is easy to recall from Subsection lll-Cl that the above recursion for is nothing but the one-dimensional 
discrete Riccati recursion. 

Suppose we have V = 0. Then = for all n and cr^ = 1 for all n. In other words, the information 
rate h{y) — h{Z) = 0; obviously, if we send nothing, the information rate should be zero. 

Now take any e > 0. If F ~ A^(0, e), Lemma Hl.^riv)! shows that — > where is the positive 
solution to the one-dimensional Riccati equation 

,2 _ .2 2 , T _ {-P{a-P + X)s^ + ^? 
^ l + (a-/3 + x)2s2 ' 

so that (7^ — > 1 + (a — /3 + x^s^. With a little algebra, we can solve the Riccati equation to get 



(x + «)^-i 

(X + a - /?) 



2 • 



which, combined with our choice of x = (1 + cr[3x) / {cjx), implies that 1 + (q — /? + x)^'S^ = On 
the other hand, 

EXl = x^sl ^ x^s^ = P. 

Hence, the coding scheme given by dSSl achieves the information rate — log x under the power constraint 
P, and hence is optimal. 

The above analysis gives two complementary interpretations for the role of the signal V. Most naturally, 
we view the feedback capacity problem as that of maximizing the information rate and V obviously has 
the role of carrying the information we wish to transmit. On the other hand, if we view the feedback 
capacity problem as that of maximizing the output entropy rate, then V has the role of perturbing the 
(nonstationary) output process so that the resulting perturbed output process has the same entropy rate 
as its stationary version. This second interpretation leads to the following observation in the spectral 
domain. 

In the notation of Cover-Pombra's n-block capacity, let denote the "almost Toeplitz" feedback 
matrix corresponding to the optimal coding scheme and Ky^ denote the message covariance matrix 
of rank one. If {Ai, . . . , A„} denote the eigenvalues of (/ + B*)Kz,n{I + B*)', then the asymptotic 
distribution of {Xi}f^i follows the optimal output spectrum Sy in (f72t . 

Now we argue that there must be one eigenvalue, say Ai, that goes down to zero exponentially fast 
(as n oo) and the rate of decay is in fact the feedback capacity. Why? The rank of Ky,^ is 1. Hence, 
roughly speaking, Kp^ is water-filling the eigenmode corresponding to Ai with small power e. This 
results in 



det(ii:^_„) = (Ai + e)Y[Xi = det{Kz, 



=2 



But we have 1 = det(i^z,n) = det((/ + B*)Kz,niI + B*)') = HLi thus Ai = x^". Therefore, 
we can view the role of the rank-one Ky^ as the tiny drop of water that fills the modified terrain 
(I + B*)Kz,n{I + B^y shaped by the optimal feedback filter B^. 
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VII. General Finite-Order ARMA Noise Spectrum 

We turn our focus to the general autoregressive moving average noise spectrum with finite order, 
say, k. We assume that the noise power spectral density Sz{e^^) has the canonical spectral factorization 

Sz{e'^) = Hz{e'^)Hz{e-'^) where 

^z[z) - —— - — T (86) 

Q(^) 1 + En=l9n^" 

such that at least one of the monic co-prime polynomials P{z) and Q{z) has degree k and all zeros 
of P{z) and Q{z) lie strictly outside the unit circle (i.e, both P{z) and Q{z) are stable). In particular, 
Sz{e^^) is bounded away from zero. 

We first prove a proposition on the structure of the optimal output spectrum, which is reminiscent of 
Corollary IKOl 

Proposition VII.l. Suppose that the ARMA(k) noise process has the rational power spectral density 
Sz{e^^)) = Hz{e'^^)Hz{e~^^), where Hz{e^^) is given in (I86t . Then, the supremum in the variational 
characterization of the feedback capacity problem 

Cfb= sup / -log|l + B(e'^)|V m 



B{e'0) J~TT 2 27r 

is attained by a strictly causal i?*(e'^) G -ffoo '^nd the corresponding output spectrum S'y(e*^) = 
|1 + 5*(e^^)p5z(e^^) is of the form Sp{e'^) = a'^HY{e'^)HY{e-'^) with 

R{z) 1 + Yl=irnz^ 

Proof. This is a simple exercise from Proposition IV. II Since Sz is bounded away from zero, the 
supremum is attainable by a strictly causal B* G H2. From the orthogonality condition (l56t . 

B^l + B^)Sz = Sp-^-^{l + B^)=:A 

is anticausal. Now consider 

S = \Q\^S^ = \P\\l+W) + \Q\^A. 

Since P and Q are polynomials of degree at most k, and (1 + B*) and A are anticausal, it is easy to 
see that S{z) is of the form 

S{z) = SkZ^ + Sk-lZ^~^ + • • • , 



that is. 



r S{e'')e-'^' - = 



for J > A; + 1. But from the symmetry 5(e*^) = ^(e *^), this implies that S{z) is of the form 

or equivalently, S{z) has the canonical factorization S{z) = a'^R{z)R{z~^) for some polynomial R of 
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degree at most k. Since 



1 + B* must be of the form 



by = cr 



\Q\' 



\l + B\ 



\Pl 

\Q\' 



l + B*{z)=b{z)^GH, 



for some normalized Blaschke product b{z) with \b{e 



ie\\2 



□ 



As was hinted at the end of the previous section, the state-space representation leads to a much 
richer development. Our result is, in some sense, expected from a motivating result by Yang, Kavcic, 
and Tatikonda [106], which shows that the feedback-dependent Markov source distribution achieves the 
maximum of finite-dimensional Marko-Massey directed mutual information [56] of a finite-state machine 
channel. However, the proof technique in [106] does not seem to be applicable to our situation, so we 
have to take a different approach. 

We start by introducing the state-space model for the ARMA(A;) noise spectrum (I86t . Given stable 
monic polynomials P{z) and Q{z) with coefficients {pn]n=i {^n}J^=i, respectively, as in (l86t . we 
construct real matrices F, G, and H of sizes k x k, k x 1, and 1 x A; as 



-qi -q2 
1 
1 









G=[l 
H = [{pi - qi) ■ 



-qk 




1 . 

1 / 



qk)] ■ 



Let {Un}'^=_^ be independent and identically distributed normal random variables with zero mean and 
unit variance. We introduce a state-space model of a linear system driven by {Un} as the input: 



(87) 



Sn+l — FSn + GUn 
Zn = HSfi + Un 

where the state Sn and the input Un are independent of each other. We can easily check that the output 
{Zn}^oo is ^ Stationary Gaussian process with power spectral density Sz{e''^) = where 

Q[z) 

_ det(/ - z{F - GH)) 
~ det(/ - zF) 

= zH{I - zFY^G+l. 
Under the above state-space representation, the channel output can be expressed as 

Yn = Xn + Zn = X„ + HSn + Un- (88) 



We state our main result in this section. 
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Theorem VII.l. Suppose the stationary Gaussian noise process {Zi}'^^ has the state-space represen- 
tation dSVl . Then, the feedback capacity CfB of the Gaussian channel Yi = Xi + Zi, i = 1, 2, . . . , under 
the power constraint P, is given by 

Cfb = max \ log(l + {X + H)i:+{X){X + H)') (89) 

where the maximum is taken over all X ^M}^^ such that F — G{X + H) has no unit-circle zero and 
XTij^{X)X' < P, with being the maximal solution to the discrete algebraic Riccati equation 

I + {X + H)^{X + H)' ^ ' 

We prove Theorem IVII.ll in two steps. The first step is the following structural result. 

Lemma VII.l. Suppose the ARAlA(k) noise process has the state-space representation ( 1871 ). Then the 
feedback capacity is achieved by the input process {Xn}^=_^ of the form 

Xn = X{Sn-E{Sn\Y!^-')) 

for some X G M^^'^ such that F — G{X + H) has no unit-circle eigenvalue. 

Proof. Suppose that B{z) = X^^i hje^^^ achieves the maximum of the variational problem in dSSl . or 
equivalently, the stationary process {Xi]'^_^ defined by X„ = "^ZfLi bj^n-j achieves the feedback 
capacity. If we regard Xn as a vector in the Hilbert space generated by linear spans of {Zi}°^^, Xn 
lies in the closed linear span of all past Zj's, that is, Xn G clin{Z"^^}. Equivalently, 

X„eW„:=clin{5",y_"-i}. 

We decompose X„ into two orthogonal parts as 

-^n ~l~ Cm 

where ^„ lies in the closed linear span Qn of 5„ and Y!^^, and C„ lies in the orthogonal complement 
of Qn in Tin, namely, 

€ Gn := dm{Sn,Y!^-^} 

Cn G {HnQ Qn)- 

Since {X„} achieves the feedback capacity, from the orthogonality condition (I56t in Proposition lV.il 

in = x{Sn - i^(s„|y?-^)) 

for some X G M^^'^. In other words, for each orthogonal feedback filter B{z), we have a representation 

Xn = X{Sn - E{Sn\Y:i^^)) + (91) 

for some X G M}""^. 

To ease the notation a little, we shall subsequently write 

An := E{An\Y!^^^) 
A — A — A 
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for a generic random variable (or a random vector) A„. Under this notation, we have 

Yn = XSn + HSn + Cn + Un (92) 



SO that 



Y^ = {X + H)Sn + Cn + Un. 



(94) 



Let = -ECn) '^'^ = E^n^ S := cov (S'n). Then, from the mutual orthogonality of Cn, Un, and 

a^ = iX + H)^{X + H)' + P^ + l. 
On the other hand, it is easy to check that 

Sn+l = E{FSn + GUn\Y!^^) 

= E{FSn + GUn\Y!^^^) + E{FSn + GUn\Y) 

= FSn + TYn, (93) 

where 

r := ^(FS(X + /7)' + G). 
Thus, we have the state-space representation of Yn as 

= {F- r{x + H))Sn - rCn + (G - r)Un 

Yn = {X + H)Sn + Cn + Un, 

which implies that S satisfies the following discrete algebraic Riccati equation (DARE): 

S = (F - T{X+H))T.{F - T{X+H))' + (G - V)[G - V)' + P(TT' 

= F^F' + GG' - {F^{X + H)' + + + {X+H)Y.{X+H)'y^{F^{X+H)' + G)' . (95) 

We now ask the question whether there exists a positive semidefinite solution S+ to the DARE (l95l 
that stabilizes the matrix 

F-T{X + H) = F- + H)' + G){1 + P^ + {X + H)^+{X + H)'Y^{X + H), 

that is, all eigenvalues of F — T{X + H) lie in the unit circle. Obviously this condition is necessary to 
make the state-space equations dSSl and \9A\ have any meaning for the stationary output process and its 
innovations. 

We note that the stability of F clearly implies the detectability of {F,X + H] (i.e., there exits a 
matrix K such that F - K{X + H) is stable). In turn, for Pq > 0, the stability of F - GH impUes the 
unit-circle controllability of {F(^,Gq}, where 

G{X + H) 



Fq = F 



l + Pc 



C 



This condition of the unit-circle controllability (or controllability on the unit circle) means that there 
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exists a matrix K such that — Gc^K has no eigenvalues on the unit circle. When = 0, the unit- 
circle controllability of {Fq, G^} is equivalent to the condition that F — G{X + H) has no unit-circle 
controllability. 

Turning back to the above question of the existence of the stabilizing solution to \95l . we see from 
a standard result on DARE [42, Theorem E.5.1] that the detectability of {F, X + H] and the unit-circle 
controllability of {F<^,G<^} is equivalent to the existence of a stabilizing solution S+. Moreover, this 
stabihzing solution is unique and positive semidefinite. Therefore, the input process described by doTT l 
and the corresponding output process ( I92l i are well-defined and uniquely determined by (X, P^). 

Now we prove that Pq is necessarily zero. We first observe that the derivation of the state-space 
equation \9A\ depends on the fact that Qn G = clin{S'", Y^^^} only via the orthogonality of Cn and 
{Sn-,Y^^). Therefore, if the input process 

Xr, = X{Sn-E{Sn\Y^^^))+Qn 

achieves the feedback capacity, inducing the output distribution uniquely defined by (l88l -(l95l. any other 
input process of the form 

Xn = X{Sn - E{Sn\Y!^^^)) + Wn 

achieves the feedback capacity with the same output distribution, provided that EW^ = and Wn 
is orthogonal to {Sn,Y!^^ ,Un)? In particular, we can take Wn = Vn, where {14}5^L-oo ^ white 
Gaussian process with power spectral density 5v(e*^) = P^, independent of {Zn}^=_oo- 

But as Remark IV. II shows, a nonzero white 5^ achieves the feedback capacity only if the noise 
spectrum itself is white. Since Sz is nonwhite, P,^ must be zero. Therefore, the optimal input process 
must be of the form 

Xn=X{Sn-E{Sn\Y!^-^)) 

for some X such that F — G{X + H) has no unit-circle eigenvalue. □ 

Equipped with Lemma IVII. II the proof of Theorem IVII. II is straightforward. 
Proof of Theorem IV7/. il We know that the capacity achieving input process is of the form 

Xn = X{Sn-E{Sn\Y:i^^)). 

From (l94l . the state-space equation for Yn becomes 

Sn+l = {F-T{X + H))Sn{G-V)Un 

(96) 

Yn = {X + H)Sn + Un, 

where 

r = T{x) = (Fs+(x + H)' + G)(i + {x + H)i:+{x + Hyy^ 

and = is the unique positive semidefinite stabilizing solution to the DARE 

S = FSP' + GG' - (PS(X + H)' + G){l + {X + H)11{X + H)')~^{FT.{X + H)' + G)' . 

^Although E{Sr,\Y"^) is symbolically the same for any choice of Wn, each could result in different output processes 
defined recursively by ~ X{Sn — E{Sn\Y"^^)) + Wn + Un- However, our analysis of the Riccati equation shows that 
the output process is uniquely defined for any choice of W„. 
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Since y„ is a white process with variance 

^2 = 1 + (X + h)j:+{x){x + H)' 

_ det{F - G{X + H)) 
~ det(F - r(X + H)) 

and h{y) = h{y), the corresponding information rate is 

^log{l + (X + H)^+{X){X + H)') 

under the power consumption XTi^{X)X' . Clearly, the feedback capacity Cfb{P) is the maximal 
information rate over all X's satisfying the power constraint X'L+{X)X' < P. □ 

The proofs of Lemma IVII.ll and Theorem IVII.ll reveal the structure of the optimal output spectrum 
once again (cf. Proposition IVII. It . Indeed, we have 

— HSn + Y-n, 

which, combined with ( l93t . implies 

2 det(/ - z{F - TH)) det(/ - z'^jF - TH)) 
= ^ det(/ - zF) det(/ - z-^F) ' ^^^^ 

which is bounded away from zero [42, Lemma 8.3.1]. Furthermore, since the optimal input can be 
expressed as X„ = XSn, we can easily check from that the corresponding feedback filter is given 
as 

Biz) ^ zXil - ziF - nX + H))rHG - r) ,^^(,^!^;(7/^^)) 

^ det(/ - zjF - GjX + H))) det(/ - zjF - TH)) _ 

det{I - z{F -T{X + H))) det{I - z{F -GH)) ' ^ ^ 

From Lemma lll.^riv)| it is easy to see that 

det{I - ziF - G{X + H))) 
det{I - z{F -r{X + H))) 

is a normalized Blaschke product whose zeros determine the entropy rate of the output process. 

Now we can easily relate Theorem IVII. II to the Schalkwijk-Kailath coding scheme. Since we already 
went through detailed discussions of the Schalkwijk-Kailath coding for the first-order ARMA spectrum 
in the previous section, we give here a rather sketchy argument. For simplicity, assume the state-space 
representation ( I87l i of the noise process {Zi}'?^^ with 5*0 = and Uq = 0. For the initial k transmissions, 
the transmitter sends X„ = Vn, n = 1, . . . , k, with V'' ~ -/V/^(0, Ky) and subsequently, 

Xn = X{Sn - S(5„|y"-i)), n = k + l,k + 2,... (99) 



where X £ R^^'^ achieves the maximum in (l89t . In other words, after the initial k transmissions, the 
transmitter refines the receiver's error of the current noise state. Since the error is /c-dimensional, one 
must project it down in the direction X. 

Lemma llT3] shows that, as far as Ky is positive definite, or equivalently, as far as cov{Sk+i\Y^) is 
positive definite, cov(5„|y"~^) converges to the unique stabilizing solution I1+(X) of the DARE (l90l 
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and thus /i(y„|K|"~^) converges to CfB- It is also straightforward to rewrite the coding scheme ( l99t as 
the successive refinement of the message-bearing signal {Vi, . . . ,Vk), from which we can generalize the 
original one-dimensional Schalkwijk-Kailath coding scheme into the fe-dimensional one with (^i, . . . , 9^) 
in some equally spaced constellation. (Instead of using the minimum mean square error estimate of V^, 
we use the minimum variance unbiased estimate of 9^; both estimates are linearly related [42, Section 
3.4] [55, Section 4.5].) As before, we can also interpret the role of as tiny drops of water that fill 
the noise terrain modified by the optimal feedback filter B*{z). 

Finally, we give a more explicit characterization of the optimal direction X. 

Proposition VII.2. Suppose X G M^^'^ satisfies the following conditions. 

(100) Power: XT,^{X)X' = P, where is the unique stabilizing solution to the DARE ( I90t . 

(101) Eigenvalues: F — G{X + H) has distinct eigenvalues ai, . . . ,ak outside the unit circle. In 
particular, S4. >- 0. 

(102) Spectrum: The corresponding output spectrum Sy{z) in (197 1 is such that 

< 5y(ai) = 5y(a2) = • • • = Syictk) < mill Syiz). 

\z\=l 

Then, X achieves the maximum in ( I89l l. 

Proof. We show that the conditions dlOOI l-l fTU^ implies the conditions (I5^-(I551 in Proposition IV. 3 1 
which in turn implies that the corresponding feedback filter achieves the feedback capacity. The power 
condition d63t is satisfied by (llOOt . For the other two conditions, take A = S'y(Qi) < min|^|=i Sy{z), 
which satisfies d64l l automatically. We use the notation (see (l98ll): 



A{z) 


= det(/ - 


z{F- 


G{X + H))) = (1 - al^z) •••(!- a-^z 


A*{z) 


= det(/ - 


z{F- 


T{X + H))) = [l - aiz) ■ ■ ■ {I - akz) 


P{z) 


= det(/ - 


z{F- 


GH)) 


Qiz) 


= det(/ - 


zF) 





and 



R{z) = det{I - z{F -TH)). 

Note that 



A{z) A{z-') 



A#{z) A*{z~^] 

Under this notation, the noise spectrum Sz{z), the feedback filter B{z), and the corresponding output 
spectrum Sy{z) can be written as 

P{z) P{z~^) 



Sz{z) 



Q{z) Q{z' 
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and 

Sy{z) = a 



Q{z)Q{z-^y 
We now consider 

f{z) = A{z){\Q{z)Q{z-^) - a'^R{z)R{z-^)) + a'^A*{z)P{z)R{z-^). 

We will show that f {z) / {A^{z)Q{z)) is anticausal by showing that f{z) has factors Q{z) and A^{z). 
Indeed, since Xn -L Y^~^, we have the anticausality of B{z)Sz{z){l + B{z~^)), or equivalently, 

A{z) Rjz) \ Pjz) Ajz-') _ ,/R{z) A*iz)Piz) \ 
A#{z)P{z) )q{z)A#{z-^) \Q{z) A{z)Q{z)J 

is anticausal, which implies that A{z)R{z) — A"^ {z)P{z) and thus f{z) have a factor Q{z). On the other 
hand, XQ{z)Q{z~^) — a^R{z)R{z~^) = for each Oj, i = 1, . . . , k, which implies that f{z) has a 
factor j4*(z). 

Finally, the anticausality of f (z) / {A^{z)Q{z)) implies the anticausality of 

A*{z-^) P[z~^) ( A{z) R{z) \ P{z) P{z^^ 



A{z~^) R[z-^) \A*{z)P{z) jQ{z)Q{z-^] 



Thus, the feedback filter B{z) in (I103t satisfies the causality condition (l65l and achieves the feedback 
capacity. In particular, X that satisfies the conditions set forth in (I100I) - (I102I) maximizes ( l89t . □ 



VIII. Concluding Remarks 

We have given an attempt to solve the Gaussian feedback capacity problem in a closed form. A 
variational characterization of the feedback capacity was found (Theorems II V. 1 1 and IV. U : 

r 1 5y(e*«) + |1 + S(e^^)|25z(e*^) dO 
Cfb = sup / - log 



5v(e«'),B{e-«)y-7r 2 5'z(e*'') 27r 

27r' 



sup / Jlog|l + B(e^' 



B{e'«) 

which was subsequently simplified into a more explicit form when the Gaussian noise process has a 
finite-order autoregressive moving average noise spectrum (Theorem IVILU : 

Cfb = max \ log(l + {X + H)T.{X + H)') 

^ - F^F' I cc ^^^^^ + + ^)(-^^(-^ + + 

I + {X + h)j:{x + H)' 



55 



and was solved completely in a closed form when the noise spectrum is the first-order autoregressive 
moving average (Theorem IVI.H : 



Cfb = - log X 
(1 - x^){l + aaxf 



(l + f7/?x)2 

The optimal coding scheme was interpreted as a natural extension of the Schalkwijk-Kailath linear 
signaling scheme: 

X^ = X{Sn-E{Sn\Y^-')) 

which strongly confirms the common belief that the stationary Wiener/Kalman filter is the optimal 
feedback processor. 

In some sense, our development can be viewed as an asymptotic analysis of the sequence of convex 
optimization problems 

CfB = lim max -log ■ . d4b 

Thus, it is refreshing to note that the pivotal proof ingredients, not to mention the origination of the 
problem and the interpretations of the solution, have information theoretic flavors. Indeed, the proof 
of Theorem IIV. II relies heavily on the maximum entropy argument, while the proof of Theorem IVII.ll 
uses Shannon's water-filling solution for the Gaussian nonfeedback capacity problem to reach a certain 
contradiction. 

Even in its current intermediate form, the solution to the Gaussian feedback capacity problem reveals a 
rich connection between control, estimation, and communication; roughly speaking, the communication 
problem over the Gaussian feedback channel is equivalent to a stochastic control problem of the receiver's 
estimation error, which is, in turn, equivalent to the maximum entropy problem of the output spectrum. 
We conclude by posing a few remaining questions that will invite further investigations to illuminate a 
complete picture of this fascinating interplay between control, estimation, and communication. 

1) From Theorem IV. II and Szego-Kolmogorov-Krein theorem, we get the following max-min char- 
acterization of the feedback capacity: 



Cfb = sup inf - log 



{bk} {flfc} ^ ^ 



1 -E^ke'^'iy -^b.e^'^'^fszie^')^] (104) 

k ^ / 



where the infimum is taken over all finite sequences {ofc} and the supremum is over all finite 
sequences {bk} satisfying 

•^"'^ k ^ 
Thus, the feedback capacity problem can be viewed as a game between the controller (feedback 
filter) B{z) = J2k ^k^^ and the estimator A{z) = a^z^ . Does this game has a saddle point? If 
so, can we get an explicit characterization of the saddle point and the associated value of the game? 
The objective of the optimization problem (I104l i is not quasi-convex-concave in {A, B) and the usual 
Fan-Sion minimax theorems [23] [83] do not apply. Nonetheless, the problem is quadratic, so a 
careful application of the S-procedure (see Yakubovich [100]) might lead to an interesting answer. 
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2) We wish to claim that Proposition IV. 31 gives a "characterization" of the optimal feedback filter B{z). 
Unfortunately, there are two important links missing to fully justify this claim. First, we should 
prove the existence^ of a filter B that satisfies the conditions (F^-(F5t. The similarity of the 
finite-dimensional dual optimization problem (145 1 and its infinite-dimensional version d62l suggests 
that strong duality may continue to hold for the infinite-dimensional problem ( 15^ . It should be 
noted, however, that because the sequence {Cfg .„} of the finite-dimensional feedback capacities is 
superadditive, the proof technique for the convergence of the finite-dimensional primal optimization 
problem d40b to its infinite-dimensional version d53b is not directly applicable in the dual case. More 
refined tools from convex analysis in topological vector spaces (see Ekeland and Temam [20] and 
Young [107]) might be useful in proving the strong duality directly, but little progress has been 
made in this direction. 

Even with the existence proof, however, the conditions d63l) - d65b are still very complicated, so 
their utility looks somewhat limited. So the natural question is — can we characterize the optimal 
feedback filter B(z) in a more explicit manner, at least at the conceptual level as in the Wiener-Hopf 
factorization? 

3) One way of remedying the problem mentioned above is restricting attention to a limited class 
of noise spectra. This is what we did in Sections EI and IVIII with moderate success. However, 
our characterization of the ARMA(fc) feedback capacity in Theorem dVII. U . while conceptually 
appealing in the context of the Schalkwijk-Kailath coding scheme, falls short of a numerically 
tractable solution. Although the algebraic Riccati equation for a fixed projection direction X can 
be solved efficiently, for example, by the invariant subspace method, it seems that finding the optimal 
direction X* under the given power constraint P is a difficult nonconvex optimization problem. 

In this regard, the sufficient condition for the optimal direction X* in Proposition lVII.2l has a rather 
interesting implication — it indirectly characterizes the solution of the nonconvex optimization prob- 
lem ( l89t . which is difficult to solve even numerically. Can we give a more explicit characterization 
of the optimal direction X* that satisfies the conditions (I100l) - (ll02t ? 

4) There are two more possible connections to optimal control theory. First, the last condition (I102t is 
reminiscent of the classical interpolation problem studied by Pick and Nevanlinna (see, for example. 
Ball, Gohberg, and Rodman [3]). Second, our variational characterization of the feedback capacity 
problem may have some relevance to the risk-sensitive or minimum-entropy control/estimation 
problem (see Whittle [96] and Mustafa and Glover [58]) as was pointed out by Babak Hassibi, 
Stephen Boyd, and Sanjoy Mitter in private communications. Indeed, the dual i62i to the feedback 
capacity problem has the leading entropy term log (;/ — -01(6*^)) |^ that looks similar to the one 
in the minimum-entropy control problem. Can these connections be made more clear and precise? 

5) Finally, Sergio Verdu posed the following question in the context of (the growth of) spectral 
efficiency in the wideband regime [95]: As a function of the power constraint, is C^g(O) strictly 
larger than C"(0)? In contrast to Dembo's result [15] on the first derivative CPg(O) = C"(0), we can 
show that C^g(O) > C"(0) and even surprisingly that C^g(O)— C"(0) = oo for rational noise spectra. 
However, in order to answer the real question whether feedback increases the spectral efficiency in 



''We do know that, if the noise spectrum is bounded away from zero, there exists an optimal filter B that achieves the feedback 
capacity; see Proposition lV.il The question here is, roughly speaking, whether the sufficient condition in Proposition IV.3I is 
necessary as well. 
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the wideband regime, we need to better understand the physics of the Gaussian feedback channel. 
Maybe it is too late to ask this question, but where does the discrete-time Gaussian feedback channel 
come from? 

Usually the physical model for the discrete-time Gaussian nonfeedback channel comes from the 
corresponding continuous-time Gaussian nonfeedback channel; see Gallager [24, Chapter 8] and 
Wyner [98] for details on slightly different alternatives. For two reasons, unfortunately, the usual 
method fails to yield our feedback channel model. First, the usual approach is based on the 
Karhunen-Loeve expansion of the continuous-time waveform filtered noise process Z{t), which 
gives a parallel Gaussian channel in which the noise of the nth orthogonal channel has variance 
that corresponds to the eigenvalue of the Karhunen-Loeve expansion. On the other hand, our 
discrete-time channel model is defined through a temporally correlated stationary Gaussian noise 
process. One may argue that this discrepancy is of minor importance in the nonfeedback case, since 
the capacity is determined solely from the eigenvalues of the noise spectrum, not the eigenvectors. 

The second reason, however, is much more fundamental to the nature of causal feedback. Indeed, 
the "time" indices for the components of the parallel channel coming from the Karhunen-Loeve 
expansion do not correspond to the physical time, and hence they have no causal relationship among 
them. How can we causally code over components of the Karhunen-Loeve expansion? 

Schalkwijk's original paper [77] considers the following "physical" model for the discrete-time 
white Gaussian noise channel: The transmission take at integer time values with the unit of time 
being 1/2W. Numbers are sent by amplitude modulation of "some basic waveform" of bandwidth 
W. The disturbance is white Gaussian noise and the received output comes from a matched filter. 

But it is the very paradox of time-limited and band-limited signals that leads to the rigorous 
treatments by Wyner and Gallager, based on the Karhunen-Loeve expansion! For the particular 
case of the strictly band-limited channel, prolate spheroidal functions (see Slepian, Landau, and 
PoUak [84], [53], [54]) form a basis for the Karhunen-Loeve expansion, which destroys the time 
causality of feedback. 

If we allow the amplitude-modulating waveform s{t) to span arbitrary bandwidth, in particular, 
if s{t) is a rectangular pulse, then the resulting discrete-time channel is the usual additive white 
Gaussia noise channel, where the noise process {Zn} at the matched filter output is given by 



and W{t) is the standard Brownian motion. Of course, we have lost the tight connection to the 
continuous-time band-limited Gaussian channel (and thus we can no longer talk about the feedback 
capacity of the continuous-time band-limited channel), yet we have a physically plausible model 
for the discrete-time Gaussian feedback channel. 

In the same vein, we can model the discrete-time first-order autoregressive Gaussian noise channel 
from an appropriate continuous-time channel, via a slightly different path. Consider the stationary 
Ornstein-Uhlenbeck process Z{t) in Ito representation (see, for example, Karatzas and Shreve [43]): 




dZ{t) = 



-aZ{t)dt + dW{t) 
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for some a > 0. Equivalently, 




dW{s). 



If we sample Z{t) to obtain the discrete-time noise process Z„ = Z{nT), the covariance sequence 
R{k) = EZ^Zn+k is given as 

R(k) = J-e-'^^l'^l 

which implies that {Zn} is the first-order autoregressive Gaussian process with parameter a = 
exp(— aT). Thus, the discrete-time first-order autoregressive discrete-time channel arises naturally 
from sampling of the continuous-time first-order autoregressive waveform channel. Is our channel 
model the right one to consider? If so, how can we extend it to the general noise spectrum? 

Back to our original question of the spectral efficiency, first note that the parameter a of the above 
channel model changes as the sampling period T changes. Moreover, we use the waveform that is 
almost band-limited, but still with infinite bandwidth. Hence, it seems quite challenging to give a 
proper definition of the spectral efficiency, let alone a rigorous analysis. 



Appendix 
Existence of an optimal {Sy,B*) 

Suppose that the noise spectrum Sz{e^^) is lower bounded by some 6 > 0. We write 

'■^ 1 ^ Sv{e'^) + |1 + S(e^^)p5z(e^^) d9 



f{Sv,B) = ilog. 



Then Cfb = sup5^ ^ /(5y, where the supremum is taken over all 5v > and strictly causal 
polynomials B{e''^) with 

Sy{e^% + \B{e^%\''Sz{e'')^<P. 
By change of variable Sy = Sy + |1 + B\'^Sz, we write 

-log(5y)— . 

Let H2{fJ,z) denote the space of analytic functions square-integrable with respect to the noise spectral 
distribution dfiz = Sz{e^^)d6. Since polynomials are dense in i^2(^z)^ it is natural to ask whether the 
maximum of g{SY,B) is achieved by an (5y,i?*) in 

K = {{Sy,B) g Li X H2{fiz) : Sy - |1 + B\^Sz > 0,B{0) = OJ^^Sy - {2B + l)Sz f < P} • 
Here the last constraint comes from the facts that 

Sv + \B\'^Sz = Sy- BSz - BSz - Sz 

and that B and Sz have real Fourier coefficients. Note that we have B £ H2 whenever B G H2{fJ-z), 
since 

/TT 
6\B\'^de < / \B\'^dfiz- 
-TT J —TV 

The rest of the proof reUes on functional analysis on topological vector spaces. See, for example, 
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Megginson [57] and Dunford and Schwartz [16] for terminology and proofs of classical theorems we 
refer to in the following discussion. 

First, we relax the constraint set K by embedding the space of 5y in Li into the space M_|_ of positive 
measures on [— 7r,7r). Noting from Lemma HOI that 

,,S,.B)=.nfil„g(/jl-X:«e-TS.(e")^) 
with the infimum over all polynomials with coefficients {au}, we define 

5(/iy, B) = mf i- log l^y" |1 - ^ flfce^'^^pd/iy j . 

If /^iy is decomposed into absolutely continuous and singular parts as d^Y = SY{e^^)d9 + diJLY,s, 
Lemma HOI shows that 

~g{^lY,B)=g{SY,B), 

independent of the singular part ^y,s- 

Now we prove that the maximum of g{fiY, B) is attained in 

K = {{fiY,B) e M+ X H2{fiz) ■■ {dfiY - |1 + B\^dfiz) G M+, 

^(0) = 0, dfiY - j:^{2B + 1) df,z) < p} . 

Recall that Af+ is a subset of the space M of signed measures and M is isomorphic to the space of 
linear functionals on continuous functions on [— 7r,7r), that is, M ~ C[7r,7r)*. Also H2{fJ,z) is a Hilbert 
space and the dual of itself. We will show that the constraint set K is compact in the product topology 
of weak* topology on M+ and weak (=weak* because H2{fJ,z) is a Hilbert space) topology on H2{fJ-z)- 
And then we show that g is upper semicontinuous under the same topology. This clearly implies that 
the maximum of g is attained in K. (That the maximum of an upper semicontinuous function is attained 
on a compact domain is well-known. For the proof, see, for example, Luenberger [55, Sections 2.13, 
5.10].) Finally, because ^(/Uy) depends only on the absolutely continuous part of ^y, if the maximum 
of g is attained by (^y,i?*) e K, there exists {Sp,B*) G K that attains the same maximum of g; 
clearly, any singular part of the spectral distribution wastes the power. The details of the proof follow. 

All topological properties such as compactness, closedness, and continuity will be used with respect 
to the product topology of weak* topologies on M+ and i/2(/^z), unless noted otherwise. 

For compactness, we observe that Ki = {fiy £ : ^ fl^dju-Y < P} and K2 = {B G H2{^z) '■ 
2^ I Bp d^z < P} are norm balls in respective norm topologies; thus both are weak* compact by 
Alaoglu-Banach theorem, and so is Ki x K2- Since K C Ki x K2, closedness of K will guarantee its 
compactness. Since 5(0) = if and only if J^^B{e^^)de = 0. Now that Ti{B) := f^^B{e'^) ^ = 
J^^(l/5z(e*^)) • B{e'-^) dfiz{0) is bounded and linear, and thus weakly* continuous, {B{0) = 0} = 
rrH{0}) is closed. Similarly, 

/TT f'TV 
d^LY- / {2B + l)d^Lz 
-TT J — TT 

is continuous, so the set 

{i/_/«--^/j2B + l)*.<p} 
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is closed. Finally, dfiy — |1 + B\'^dfiz is a positive measure if and only if 

/TT piT 
^dfiY- / (l)\l + B\'^d^Lz>0 
-TT J —TV 

for all < G C[— vr, vr). But for each > 0, f^^ 0|1 + ^pc^Mz is (strongly) continuous and convex. 
Therefore, it is also weakly (=weakly*) lower semicontinuous; see Ekeland and Temam [20, Section 
2.2]. This impUes that Tip is upper semicontinuous and T^^([0, oo)) is closed. Since the intersection of 
an arbitrary collection of closed sets is closed, {(/iy , B) : dfiy — 11 + B\'^dfiz G M^} = n(^T^^([0, oo)) 
is closed. For the same reason, K is closed, and as a closed subset of a compact set, it is compact as 
well. 

For weak* upper semicontinuity of ^(/iy), we first fix /iy G and note from the definition of 
weak* convergence that 

an{p) := r \l-p{e'')\^dfiY,n ^ T |1 - p(e*')pd^y =: a{p) 

J —TV J —TV 

for any fixed strictly causal polynomial p and any sequence ^y„ weakly* convergent to ^y. Hence 

inf lima„(p) = inf a(p) = g^fiy)- 

p n p 

Now for each n, we can find a strictly causal polynomial pn such that 

) < inf a„(p) + -. 
p n 

By taking limits on both sides, we get 

liminfan(p) < Iima„,(p„,) < inf an(p). 

n p n p 

In other words, 

/TT t"K 
|1 — ppd/iy „ < inf / \1 — pl^dfly = gifJ'Y), 
P J -TV 

for any /uy^^ weakly* convergent to ^y. Thus, g{^JLy) is weakly* upper semicontinuous. This completes 
the proof that the maximum of the variational characterization of the feedback capacity is achievable. 

Finally we remark that the condition that Sz is bounded away from zero is necessary. As a simple 
example, if Sz{e^^) = |1 + e*^p, it is shown in Section EI that the feedback capacity of this noise 
spectrum corresponds to the output spectrum Sy of the form 

Sy{z) = \{l + x^z){l + x^z"^). 
But we can easily check that there is no (5y,-B) resulting in this output spectrum. 
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